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ABSTRACT 


Block  tridiagonal  systems  of  linear  equations  occur  frequently  in 
scientific  computations,  often  forming  the  core  of  more  complicated  prob- 
lems. Numerical  methods  for  solution  of  such  systems  are  studied  with 
emphasis  on  efficient  methods  for  a vector  computer.  A convergence  theory 
for  direct  methods  under  conditions  of  block  diagonal  dominance  is  developed, 
demonstrating  stability,  convergence  and  approximation  properties  of  direct 
methods.  Block  elimination  (LU  factorization)  is  linear,  cyclic  odd-even 
reduction  is  quadratic,  and  higher-order  methods  exist.  The  odd-even 
methods  are  variations  of  the  quadratic  Newton  iteration  for  the  inverse 
matrix,  and  are  the  only  quadratic  methods  within  a certain  reasonable 
class  of  algorithms.  Serai-direct  methods  based  on  the  quadratic  conver- 
gence of  odd-even  reduction  prove  useful  in  combination  with  linear  itera- 
tions for  an  approximate  solution.  An  execution  time  analysis  for  a pipe- 
line computer  is  given,  with  attention  to  storage  requirements  and  the 
effect  of  machine  constraints  on  vector  operations. 
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1.  INTRODUCTION 


Block  tridiagonal  matrices  are  a special  class  of  matrices  which  arise 
in  a variety  of  scientific  and  engineering  computations,  typically  in  the 
numerical  solution  of  differential  equations.  For  now  it  is  sufficient  to 
say  that  the  matrix  looks  like 


where  b.  is  an  n.  X n.  matrix  and  a.,  c.  are  dimensioned  conformally.  Thus 
1 11  -N1  1 N 

the  full  matrix  A has  dimension  (,  n.)  x ( n.). 

j-1  J ‘">1  J 

In  this  thesis  we  study  numerical  methods  for  solution  of  a block  tri- 
diagonal linear  system  Ax  = v,  with  emphasis  on  efficient  methods  for  a 
vector-oriented  parallel  computer  (e.g.,  CDC  STAR- 100,  Illiac  IV).  Our 
analysis  primarily  concerns  numerical  properties  of  the  algorithms,  with 
discussion  of  their  inherent  parallelism  and  areas  of  application.  For 
this  reason  many  of  our  results  also  apply  to  standard  sequential  algo- 
rithms. A unifying  analytic  theme  is  that  a wide  variety  of  direct  and 
iterative  methods  can  be  viewed  as  special  cases  of  a general  matrix  itera- 
tion, and  we  make  considerable  use  of  a convergence  theory  for  direct 
methods.  Algorithms  are  compared  using  execution  time  estimates  for  a 
simplified  model  of  the  CDC  STAR- 100  and  recommendations  are  made  on  this 


basis . 
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There  are  two  important  subclasses  of  block  tridiagonal  matrices, 
depending  on  whether  the  blocks  are  small  and  dense  or  large  and  sparse. 

A computational  method  to  solve  the  linear  system  Ax  = v should  take 
into  account  the  internal  structure  of  the  blocks  in  order  to  obtain 
storage  economy  and  a low  operation  count.  Moreover,  there  are  impor- 
tant applications  in  which  an  approximate  solution  is  satisfactory. 

Such  considerations  often  motivate  iterative  methods  for  large  sparse 
systems,  especially  when  a direct  method  would  be  far  more  expensive. 

Since  1965,  however,  there  has  been  an  increased  interest  in  the 
direct  solution  of  special  systems  derived  from  two-dimensional  elliptic 
partial  differential  equations.  With  a standard  five-point  finite  differ- 
ence approximation  on  a rectangular  grid,  the  a^,  c^  blocks  are  diagonal 
and  the  b^  blocks  are  tridiagonal,  so  A possesses  a very  regular  sparse 
structure.  By  further  specializing  the  differential  equation  more  struc- 
ture will  be  available.  Indeed,  the  seminal  paper  for  many  of  the  impor- 
tant developments  dealt  with  the  simplest  possible  elliptic  equation.  This 
was  Hockney's  work  [H5],  based  on  an  original  suggestion  by  Golub,  on  the 
use  of  Fourier  transforms  and  the  cyclic  reduction  algorithm  for  the  solu- 
tion of  Poisson's  equation  on  a rectangle.  For  an  n X n square  grid  Hockney 

3 

was  able  to  solve  Ax  * v in  0(n  ) arithmetic  operations,  while  ordinary  band- 

4 3 

matrix  methods  required  0(n  ) operations.  Other  0(n  ) methods  for  Poisson's 
equation  then  extant  had  considerably  larger  asymptotic  constants,  and  the 
new  method  proved  to  be  significantly  less  time-consuming  in  practice.  Sub- 
sequent discovery  of  the  Fast  Fourier  Transform  and  Buneman's  stable  version 

2 

of  cyclic  reduction  created  fast  (0(n  log  n)  operations)  and  accurate 


methods  that  attracted  much  attention  from  applications  programmers  and 
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numerical  analysts  [B20],  [D2],  [H6].  The  Buneman  algorithm  has  since 
been  extended  to  Poisson1 s equation  on  certain  nonrectangular  regions 
and  to  general  separable  elliptic  equations  on  a rectangle  [S6],  and  well- 
tested  Fortran  subroutine  packages  are  available  (e.g. , [S8]).  Other 
recent  work  on  these  problems  includes  the  Fourier-Toeplitz  methods  fF2 ] , 
and  Bank's  generalized  marching  algorithms  [B5].  The  latter  methods  work 
by  controlling  a fast  unstable  method,  and  great  care  must  be  taken  to 
maintain  stability. 

Despite  the  success  of  fast  direct  methods  for  specialized  problems 
there  is  still  no  completely  satisfactory  direct  method  that  applies  to  a 
general  nonseparable  elliptic  equation.  The  best  direct  method  available 
for  this  problem  is  George's  nested  dissection  [Gl],  which  is  theoreti- 
cally attractive  but  apparently  hard  to  implement.  Some  interesting  tech- 
niques to  improve  this  situation  are  discussed  by  Eisenstat,  Schultz  and 
Sherman  [El]  and  Rose  and  Whitten  [R5]  among  others.  Nevertheless,  in 
present  practice  the  nonseparable  case  is  still  frequently  solved  by  an 
iteration,  and  excellent  methods  based  on  direct  solution  of  the  separable 
case  have  appeared  (e.g.,  [C4],  [C5]).  The  multi-level  adaptive  techniques 
recently  analyzed  and  considerably  developed  by  Brandt  [B13]  are  a rather 
different  approach,  successfully  combining  iteration  and  a sequence  of  dis- 
cretization grids.  Brandt  discusses  several  methods  suitable  for  parallel 
computation  and  although  his  work  has  not  been  considered  here,  it  will 
undoubtedly  play  an  important  role  in  future  studies.  The  state-of-the-art 
for  solution  of  partial  differential  equations  on  vector  computers  is  sum- 
marized by  Ortega  and  Voigt  [01]. 

References  include  [B19],  [Dl],  [S5],  [S7],  [S9]. 
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With  the  application  area  of  partial  differential  equations  in  mind, 
the  major  research  goal  of  this  thesis  is  a general  analysis  of  direct 
block  elimination  methods,  regardless  of  the  sparsity  of  the  blocks.  As 
a practical  measure,  however,  the  direct  methods  are  best  restricted  to 
cases  where  the  blocks  are  small  enough  to  be  stored  explicitly  as  full 
matrices,  or  special  enough  to  be  handled  by  some  storage-efficient  im- 
plicit representation.  Proposed  iterative  methods  for  nonseparable  ellip- 
tic equations  will  be  based  on  the  ability  to  solve  such  systems  efficiently. 

Our  investigation  of  direct  methods  includes  the  block  LU  factoriza- 
tion, the  cyclic  reduction  algorithm  (called  odd-even  reduction  here)  and 
Sweet's  generalized  cyclic  reduction  algorithms  [S10],  along  with  varia- 
tions on  these  basic  themes.  The  methods  are  most  easily  described  as  a 
sequence  of  matrix  transformations  applied  to  A in  order  to  reduce  it  to 
block  diagonal  form.  We  show  that,  under  conditions  of  block  diagonal 
dominance,  matrix  norms  describing  the  off-diagonal  blocks  relative  to 
the  diagonal  blocks  will  not  increase,  decrease  quadratically , or  decrease 
at  higher  rates,  for  each  basic  algorithm,  respectively.  Thus  the  algo- 
rithms are  naturally  classified  by  their  convergence  properties.  For  the 
cyclic  reduction  algorithms,  quadratic  convergence  leads  to  the  defini- 
tion of  useful  semi-direct  methods  to  approximate  the  solution  of  Ax  = v. 

Some  of  our  results  in  the  area  have  previously  appeared  in  [H2]. 

All  of  the  algorithms  discussed  here  can  be  run,  with  varying  degrees 
of  efficiency,  on  a parallel  computer  that  supports  vector  operations.  By 
this  we  mean  an  array  processor,  such  as  Illiac  IV,  or  a pipeline  processor, 
such  as  the  CDC  STAR- 100  or  the  Texas  Instruments  ASC.  These  machines 
fall  within  the  single  instruction  stream-multiple  data  stream  classifica- 
tion of  parallel  computers,  and  have  sufficiently  many  common  features  to 
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make  a general  comparison  of  algorithms  pc:sible.  However,  there  are 
also  many  special  restrictions  or  capabilities  that  must  be  considered 
when  designing  programs  for  a particular  machine.  Our  comparisons  for 
a simplified  model  of  STAR  indicate  how  similar  comparisons  should  be 
made  for  other  machines.  For  a general  survey  of  parallel  methods  in 
linear  algebra  plus  background  material,  [H3]  is  recommended. 

In  almost  every  case,  a solution  method  tailored  to  a particular 
problem  on  a particular  computer  can  be  expected  to  be  superior  to  a 
general  method  naively  applied  to  that  problem  and  machine.  However, 
the  costs  of  tailoring  can  be  quite  high.  Observations  about  the  behav- 
ior of  the  general  method  can  be  used  to  guide  the  tailoring  for  a later 
solution,  and  we  believe  that  the  results  presented  here  will  allow  us 
to  move  in  that  direction. 
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B[A]  in  the  analysis  of  direct  and  semidirect  methods,  though  it  has  long 
been  used  to  analyze  iterative  methods. 

Under  the  assumption  j|  B[A]||<  1,  when  a direct  method  generates  a 
sequence  of  matrices  with  ||  B[Mi+1]||  s ||  B[M.  ]||,  we  classify  the  method 
as  linear,  and  if  ||  B[M^+^]||  £ ||  B[M^]“||  or  j|  B [M_^  ] 1 1 we  say  that  it  is 
quadratic.  The  ‘“-norm  is  used  throughout;  extension  to  other  norms  would 
be  desirable  but  proves  quite  difficult. 

The  block  LU  factorization  and  block  Gauss-Jordan  elimination  are 
shown  to  be  linear  methods,  and  their  properties  are  investigated  in  Sec- 
tion 3.  Sample  results  are  stability  of  the  block  factorization  when 
||  B [ A ] j j < 1,  bounds  on  the  condition  numbers  of  the  pivotal  blocks,  and 
convergence  to  zero  of  portions  of  the  matrix  in  the  Gauss-Jordan  algorithm. 

The  methods  of  odd-even  elimination  and  odd-even  reduction,  which  are 
inherently  parallel  algorithms  closely  related  to  the  fast  Poisson  solvers, 
are  shown  to  be  quadratic,  and  are  discussed  in  Section  4.  Odd-even  elim- 
ination is  seen  to  be  a variation  of  the  quadratic  Newton  iteration  for 
A \ and  odd-even  reduction  can  be  considered  as  a special  case  of  Hageman 
and  Varga's  general  cyclic  reduction  technique  [HI],  as  equivalent  to 
block  elimination  on  a permuted  system,  and  as  a compact  form  of  odd-even 
elimination.  We  also  discuss  certain  aspects  of  the  fast  Poisson  solvers 
as  variations  of  the  odd-even  methods. 

In  Section  5 we  show  that  the  odd-even  methods  are  the  only  quadratic 
methods  within  a certain  reasonable  class  of  algorithms.  This  class  in- 
cludes all  the  previous  methods  as  particular  cases  of  a general  matrix 
iteration.  The  relationship  between  block  fill-in  and  quadratic  conver- 
gence is  also  discussed. 
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In  Section  6 the  quadratic  methods  are  shown  to  be  special  cases  of 
families  of  higher-order  methods,  characterized  by  ||  B[M^+^]||  ^ j|  BTM^ ] 1 1 P- 
We  note  that  the  p-fold  reduction  (Sweet's  generalized  reduction  [S10])  is 
equivalent  to  a 2- fold  reduction  (odd-even)  using  a coarser  partitioning 
of  A. 

Semidirect  methods  based  on  the  quadratic  properties  of  the  odd-even 
algorithms  are  discussed  in  Section  7.  Briefly,  a semidirect  method  is  a 
direct  method  that  is  stopped  before  completion,  producing  an  approximate 
solution.  We  analyze  the  approximation  error  and  its  propagation  vis-a-vis 
rounding  error  in  the  back  substitution  of  odd-even  reduction.  Practical 
limitations  of  the  semidirect  methods  are  also  considered. 

In  Section  8 we  consider  some  iterative  methods  for  systems  arising 
from  differential  equations,  emphasizing  their  utility  for  parallel  com- 
puters. Fast  Poisson  solvers  and  general  algorithms  for  systems  with 
small  blocks  form  the  basis  for  iterations  using  the  natural,  multi-line 
and  spiral  orderings  of  a rectangular  grid.  We  also  remark  on  the  use  of 
elimination  and  the  applicability  of  the  Parallel  Gauss  iteration  [T2], 

[H4  ] . 

A brief  section  on  applications  deals  with  two  situations  in  which 
the  assumption  |(B[A]j|<  1 is  not  met,  namely  curve  fitting  and  finite  ele- 
ments, and  in  the  latter  case  we  describe  a modification  to  the  system 
Ax  = v that  will  obtain  ||b[A]||  < 1. 

Methods  for  solution  of  block  tridiagonal  systems  are  compared  in  Sec- 
tion 10  using  a simplified  model  of  a pipeline  (vector)  computer.  We  con- 
clude that  a combination  of  odd-even  reduction  and  LU  factorization  is  a 
powerful  direct  method,  and  that  a combination  of  semidirect  and  iterative 
methods  may  provide  limited  savings  if  an  approximate  solution  is  desired. 
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A general  discussion  of  storage  requirements  and  the  effect  of  machine 
constraints  on  vector  operations  is  included;  these  are  essential  con- 
siderations for  practical  use  of  the  algorithms. 

l.B.  Notation 

Let  the  block  tridiagonal  matrix 


Here  b.  is  an  n.  X n.  matrix  and  a.  = 0,  c = 0.  The  subscript  N in  the 
i l l I N 

triplex  notation  will  often  be  omitted  for  simplicity.  Define 

L[A]  = (a.,  0,  0)  , 

D[ A ] = (0,  b , 0), 

U[A]  - (0,  0,  c ), 

B[A]  = I - (D[A])_1A  = (-b"1  a , 0,  -b"1  c ) , 

C[A]  - I - A(D[A])” 1 = (-a  b“}i,  0,  -c jbj^)  . 

Use  of  these  notations  for  general  partitioned  matrices  should  be  evident; 
nonsingularity  of  D[A]  is  assumed  in  the  definitions  of  B and  C.  B[A]  is 
the  matrix  associated  with  the  block  Jacobi  linear  iteration  for  the  solu- 
tion of  Ax  = v.  It  also  assumes  an  important  role  in  the  study  of  certain 


semidirect  methods  and  will  be  central  to  our  analysis  of  direct  methods. 
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We  further  define  = diag ( 0 , . . . , 0 , I,  0,...,0),  where  the  I is 

Oj  X n^  and  occurs  in  the  jth  block  diagonal  position.  Pre-  (post-)  mul- 
tiplication by  E.  selects  the  jth  block  row  (column)  of  a matrix,  so  that, 
J N N 


E . AE . , D[A]E.  = E.U[A],  and  E.  L[A  ] E . = 0. 

. , l i j 1 t,.  . i j 

i= 1 J J j=i  J 


for  example,  D[A]  = 

For  an  nXn  matrix  M and  an  n-vector  v,  let 


Pi(M) 


y.  (M) 

II  M||  = 

II  Ml' 


n 

"j-i 

n 

'i*l 


m.  . 

1 1 


|m. . = p . (M  ) , 

J-J1  J 


maxi  * c^e  co"nonn> 


max  ^ v (M) 


M 


v||  = max.  |v.J  , 
n 


i=l 


Vi' 


|M 


( I mi j I ) » and 


0 (M)  = the  spectral  radius  of  M. 


The  reader  should  take  care  not  to  confuse  row  sums  (p^(M))  with  the  spec- 


tral radius  (o(M)).  When  A = (a.,  b.,  c .)  , let  \(A)  = max  . (4  ||b  . ^ a . ||  c- 

111  j J J J J 

The  condition  \(A)  £ 1 will  be  used  as  an  alternative  to  |j  B[A]||  < 1. 
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2.  SOME  INITIAL  REMARKS 
2 . A . Analytic  Tools 

Much  of  the  analysis  presented  in  later  sections  involves  the  matrix 
B[A].  Why  should  we  consider  this  matrix  at  all? 

First,  it  is  a reasonably  invariant  quantity  which  measures  the  diag- 
onal dominance  of  A.  That  is,  if  E is  any  nonsingular  block  diagonal  matrix 
conforming  with  the  partitioning  of  A,  let  A*  3 EA  , so  B*  = B[A*]  = I-D*  *A* 
= I - (ED)  (EA)  = I - D ^A  = B.  Thus  B[A]  is  invariant  under  block  scal- 
ings of  the  equations  Ax  = v.  On  the  other  hand,  B is  not  invariant  under 
a change  of  variables  y = Ex,  which  induces  A'  = AE  We  have 
B'  = B[A' ] = EBE  \ and  for  general  A and  E only  the  spectral  radius  3(B) 
is  invariant.  Similarly,  C[A]  measures  the  diagonal  dominance  of  A by 
columns,  and  is  invariant  under  block  diagonal  change  of  variables  but  not 
under  block  scaling  of  the  equations. 

Secondly,  one  simple  class  of  parallel  semidirect  methods  (Section  7) 
can  be  created  by  transforming  Ax  = v into  A*x  = v*,  A*  = MA,  v*  = Mv  for 
some  matrix  M,  where  ||  B[A*]||  « ||  B[A]||.  This  transformation  typically 
represents  the  initial  stages  of  some  direct  method.  An  approximate  solu- 
tion y = D[A*]  *v*  is  then  computed  with  error  x - y = B[A*]x,  so  ||B[A*]|| 
measures  the  relative  error.  Thus  we  study  the  behavior  of  B[A]  under 
various  matrix  transformations. 

Thirdly,  B[A]  is  the  matrix  which  determines  the  rate  of  convergence 
of  the  block  Jacobi  iteration  for  Ax  = v.  This  is  a natural  parallel  itera- 
tion, as  are  the  faster  semi- iterative  methods  based  on  it  when  A is  posi- 
tive definite.  It  is  known  that  various  matrix  transformations  can  affect 


the  convergence  rates  of  an  iteration,  so  we  study  this  property. 
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Finally,  B[A]  and  C[A]  arise  naturally  in  the  analysis  of  block  elim- 
ination methods  to  solve  Ax  = v.  C represents  multipliers  used  in  elimina- 
tion, and  B represents  similar  quantities  used  in  back  substitution.  In 
order  to  place  upper  bounds  on  the  size  of  these  quantities  we  consider 
the  effects  of  elimination  on  ||  B[A]||. 

Since  we  will  make  frequent  use  of  the  condition  jj  B[A]||  < 1 we  now 
discuss  some  conditions  on  A which  will  guarantee  that  it  holds. 

Lemma  2.1.  Let  J = + J?  be  order  n,  |j|  = | J^j  + | J? | , and  suppose  K 

satisfies  K = J^K  + J9  . Then  ||  j||  < 1 implies  p^CK)  £ P^J)  for  1 £ i £ n, 
and  so  ||  K ||  £ j|  j||. 

Proof . From  |j^|  + j J0 1 = |j|  we  have  p^J^)  + P^C^)  = Pj_(J)  for  1 £ i £ n. 
From  K = J^K  + J we  have  p^(K)  £ o (J^)  ||  K | j + oi(J?).  Suppose  |j  k||  s 1. 

Then  P^K)  £ C ^ ( JT^)  ||  k||  + P.^^)  II  Kll  = II  K||  s ||  J||  ||  K||>  which  implies 

||  K ||  £ ||  J j|  j|  K jj  and  1 £ ||  j||,  a contradiction.  Thus  j|  k||  < 1 and 

Pl(K)  £ O.^)  + p^J.,)  = Pt(J).  QED. 

We  note  that  if  o ^ ( J * 0 then  we  actually  have  p^K)  < p^(J)  and 

if  p.(J,)  = 0 then  p . (K)  = o.(J„)  = p,(J).  It  follows  that  if  each  row  of 
l 1 iv  t 2 i 

contains  a nonzero  element  then  ||  K||  < ||  j||.  These  conditions  are  rem- 
iniscent of  the  Ste in-Rosenberg  Theorem  ([V3],  [H8])  which  states  that  if 
J is  irreducible,  and  nonnegative,  J9  * 0 and  p(J^)  < 1 then  p(J)  < 1 
implies  p (K)  < p(J)  and  p(J)  = 1 implies  p(K)  = o(J).  There  are  also  cor- 
respondences to  the  theory  of  regular  splittings  [V3]. 


Lemma  2.2.  Suppose  J = (I  - S)K  + T is  order  n,  |j|  = | (I  - S) K | + |t|, 
P-^S)  £ P^(T)  , 1 £ i £ n.  Then  ||  j||  < 1 implies  p^K)  £ p^J)  , 1 £ i £ n, 
and  so  f|  K||  £ ||  j|j. 
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Proof.  For  1 £ i £ n,  p^T)  £ p.(T)  + p.((I  - S)K)  = P^J)  £ ||  j||  < 1, 
so  p.(T)  < 1.  Now,  1 > 0i(J)  = p.((I  - S) K)  + Pt(T)  S pt(K)  - p.(S)||K|| 

+ p.(T)  & P.(K)  + P.(T)(1  - ||  K||  ).  Suppose  ||  K||  = p^(K)  > 1.  Then 

I > pj(K)  + p (T)(l  - pj(K)),  and  we  have  1 < p^(T),  a contradiction. 

Thus  ||  K ||  ^ 1 and  pi(J)  2 Pi(K).  QED. 

If  S - T in  Lemma  2.2  then  K=»SK+  (J-  S),  |j|  = | (I  - S)K|+  |s| 

= | J - S j + | S | , so  by  Lemma  2 . 1 we  again  conclude  ||  K||  £ ||  j||. 

Similar  results  hold  for  the  1-norm. 

Lemma  2.3.  Let  J * J + J,, , |j|  3 |j  | + | J0  | , and  suppose  K satisfies 

K - KJX  + J . Then  ||  J 1^  < 1 implies  ||  K||x  ^ ||  j||x . 

Proof.  Since  JT  = + J^,  |jT|  =>  | J*  | + | J*  | , KT  = J^KT  + J,  and 

II  Ml^  = ||  MT||,  we  have  ||  JT||  = ||  < 1,  and  by  Lemma  2.1,  ||  - ||  KT|| 

* (I  JT|(  " (1  J|lr  QED- 

Lemma  2.4.  Suppose  J = K(I  - S)  + T,  | j|  = | K(T  - S)  | + |t|  , v^S)  £ y^T)  , 
1 * i £ n.  Then  ||  j||.  < 1 implies  ||  k|L  * ||  j||  . 

Proof.  We  have  v^M)  = P^fM  ),  = (I  - ST)KT  + TT,  so  by  Lemma  2.2 

||  K^||  £ ||  J^||,  and  the  result  follows.  QED. 

To  investigate  the  condition  ||  B[A]||<  1,  we  first  note  that  if  A is 
strictly  diagonally  dominant  by  rows  then  j|  B[A]||  < 1 under  any  partition- 
ing of  A.  Indeed,  if  we  take  J to  be  the  point  Jacobi  linear  iteration 
matrix  for  A,  J 3 I - diag(A)  ^A,  * D[J],  Jj  * J “ then 

||  J1||  * ||  J||  < 1,  |J|  = | JL  | + | J2|  and  B[A]  - (I  - so 

||  B[A]||  * i|  J||  < l by  use  of  Lemma  2.1. 


Now,  if  we  denote  the  partitioning  of  A by  the  vector  II  = (n^ ,n,, , . . . ,n^) 
then  the  above  remark  may  be  generalized.  Suppose  H'  = (m^ ,ra9 , . . . .m^)  rep- 
resents another  partitioning  of  A which  refines  II.  That  is,  there  are  inte- 

Pi_1 

gers  1 = pQ  < p^  < . . , < pN  = M4-1  such  that  ni  = m..  If  B[A ] and 

1 ~ J = P i- 1 J 

B'[A]  are  defined  by  the  two  partitions  and  j|  B'[A]||  < 1 then  [|  B [ A ] 1 1 
£ ||  B'[A]||.  This  is  an  instance  of  the  more  general  rule  for  iterative 
methods  that  "more  implicitness  means  faster  convergence"  (cf.  [H8,  p.  107]) 
and  will  be  useful  in  Section  6.  Briefly,  it  shows  that  one  simple  way  to 
decrease  ||  B[A]||  is  to  use  a coarser  partition. 

Let  us  consider  a simple  example  to  illustrate  these  ideas.  Using 

2 

finite  differences,  the  differential  equation  -u  - u + Xu  = f on  T 0 , 1 ] , 

n xx  yy  l > j > 

X > 0,  with  Dirichlet  boundary  conditions,  gives  rise  to  the  matrix 

9 

A = (-1,  M,  -I),  M = (-1,  4+Xh",  -1).  By  Lemma  2.1,  ||  B[A]||  £ ||  J[A]||  = 

2 

4/ (4+Xh  ) < 1.  Actually,  since  B[A]  is  easily  determined  we  can  give  the 
better  estimate  ||  B[A]||  = 2 1 1 M ^ ||  < 2/  (2+Xh~)  . 

Indeed,  many  finite  difference  approximations  to  elliptic  boundary 
value  problems  will  produce  matrices  A such  that  ||  B[A]||  < 1 [ V3  ] ; |{  B[A]|| 
will  frequently  be  quite  close  to  1.  This  condition  need  not  hold  for 
certain  finite  element  approximations,  so  a different  analysis  will  be 
needed  in  those  cases.  Varah  [Vl],  [V2]  has  examples  of  this  type  of 
analysis  which  are  closely  related  to  the  original  continuous  problem.  In 
Section  9.B  we  consider  another  approach  based  on  a change  of  variable. 

||  B[ A ] 1 1 < l is  a weaker  condition  than  strict  block  diagonal  dominance 
relative  to  the  “-norm,  which  for  block  tridiagonal  matrices  is  the  condi- 
tion ([FI],  [Vl])  ||  b j 1 1|  (||  a j ||  + ||  c j 1 1 ) < 1,  1 * j s u.  The  general 
definition  of  block  diagonal  dominance  allows  the  use  of  different  norms. 


so  for  completeness  we  should  also  consider  theorems  based  on  this 
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assumption.  However,  in  order  to  keep  the  discussion  reasonably  concise 
this  will  not  always  be  done.  It  is  sometimes  stated  that  block  diag- 
onal dominance  is  the  right  assumption  to  use  when  analyzing  block  elimi- 
nation, but  we  have  found  it  more  convenient  and  often  more  general  to 
deal  with  the  assumption  ||  B[A]||<  1. 

Other  concepts  related  to  diagonal  dominance  are  G-  and  H-matrices . 

If  J = I - diag(A)  ^A=  L + U,  where  L (U)  is  strictly  lower  (upper)  trian- 
gular, then  A is  strictly  diagonally  dominant  if  ||  j||  < 1,  a G-matrix  if 
||  (I  - j L | ) 1 |U|  ||  < 1,  and  an  H-matrix  if  p(|j|)  < 1.  Ostrowski  [02] 
shows  that  A is  an  H-matrix  iff  there  exists  a nonsingular  diagonal  matrix 
E such  that  AE  is  strictly  diagonally  dominant.  Varga  [V4]  summarizes 
recent  results  on  H-matrices.  Robert  [R3]  generalizes  these  definitions 
to  block  diagonally  dominant,  block  G-  and  block  H-matrices  using  vector- 
ial and  matricial  norms,  and  proves  convergence  of  the  classical  iterative 
methods  for  Ax  = v.  Again,  for  simplicity  we  will  not  be  completely  gen- 
eral and  will  only  consider  some  representative  results. 

2 .B . Models  of  Parallel  Computation 

In  order  to  read  Sections  3-9  only  a very  loose  conception  of  a vector- 
oriented  parallel  computer  is  necessary.  A more  detailed  description  is 
needed  for  the  execution  time  comparison  of  methods  in  Section  10.  This 
subsection  contains  sufficient  information  for  the  initial  part  of  the 
text,  while  additional  details  may  be  found  in  Appendix  A,  [H3],  [S3],  and 
the  references  given  below. 

Parallel  computers  fall  into  several  general  classes,  of  which  we  con- 
sider the  vector  computers,  namely  those  operating  in  a synchronous  manner 


and  capable  of  performing  efficient  floating  point  vector  operations. 
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Examples  are  the  array  processor  Illiac  IV,  with  64  parallel  processing 
elements  [B6],  [B12];  the  Cray-1,  with  eight  64-word  vector  registers  [C8]; 
and  the  pipeline  processors  CDC  STAR- 100  [C2]  and  Texas  Instruments  ASC 
[Wl],  which  can  handle  vectors  of  (essentially)  arbitrary  length.  These 
machines  have  become  important  tools  for  large  scientific  computations, 
and  there  is  much  interest  in  algorithms  able  to  make  good  use  of  special 
machine  capabilities.  Details  of  operation  vary  widely,  but  there  are  suf- 
ficiently many  common  features  to  make  a general  comparison  of  algorithms 
poss ible . 

An  important  parallel  computer  not  being  considered  here  is  the  asyn- 
chronous Carnegie-Mellon  University  C.mmp  [W4],  with  up  to  16  minicomputer 
processors.  Baudet  [B7]  discusses  the  theory  of  asynchronous  iterations 
for  linear  and  nonlinear  systems  of  equations,  plus  results  of  tests  on 
C . mmp . 

Instructions  for  our  model  of  a vector  computer  consist  of  the  usual 
scalar  operations  and  a special  set  of  vector  operations,  some  of  which 
will  be  described  shortly.  We  shall  consider  a conceptual  vector  to  be 
simply  an  ordered  set  of  storage  locations,  with  no  particular  regard  to 
the  details  of  storage.  A machine  vector  is  a more  specialized  set  of 
locations  valid  for  use  in  a vector  instruction,  usually  consisting  of 
locations  in  arithmetic  progression.  The  CDC  STAR  restricts  the  progres- 
sion to  have  increment  = 1,  which  often  forces  programs  to  perform  extra 
data  manipulation  in  order  to  organize  conceptual  vectors  into  machine 
vectors . 

We  denote  vector  operations  by  a parenthesized  list  of  indices.  Thus 


the  vector  sum  w = x+y  is 
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w.  = x.  + y . , (1  s i s N) 

1 t i 

, N 

if  the  vectors  are  in  R , the  componentwise  product  would  be 


Wi  = Xi  X yi’  C1  s ^ N>  » 


and  a matrix  multiplication  C = AB  would  be 


Cij=_^1aik  v (1  "N;  1 sn)- 

Execution  time  for  a vector  operation  depends  on  the  length  of  the  vector 

and  the  internal  mechanism  of  the  vector  computer,  but  can  usually  be 

written  in  the  form  T N + c if  the  vector  length  is  N.  On  a pipeline 

op  op  r r 

computer,  1/ t is  the  asymptotic  result  rate  and  a is  the  vector  instruc- 
op  op 

tion  startup  cost.  On  a k-parallel  computer  (one  with  either  k parallel 

processors  or  k-word  vector  registers)  the  basic  time  would  be  t for  k 

op 

operations  in  parallel  and  the  vector  instruction  cost  t N k~  + overhead. 

op 

In  either  case  the  tN  + a model  is  adequate  for  vector  operations;  scalar 


operation  times  are  given  as  t 


op 


Simplified  instruction  execution  times 


for  the  CDC  STAR- 100  are  given  below,  in  terms  of  the  40  nanosecond  cycle 


time.  Usually  T « t « a 

op  op  op 


operation 

scalar  time 

vector 

time 

add,  subtract 

13 

b* 

71 

multiply 

17 

N + 

159 

divide 

46 

2N  + 

167 

square  root 

74 

2N  + 

155 

transmit 

11 

91 

maximum 

- 

6N  + 

95 

summation 

- 

6.5N 

+ 122 

inner  product 

- 

6N  + 

130 
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The  use  01  .ong  vectors  is  essential  for  effective  use  of  a vector 
computer,  in  order  to  take  full  advantage  of  increased  result  rates  in 
the  vector  operations.  The  rather  large  vector  startup  costs  create  a 
severe  penalty  for  operations  on  short  vectors,  and  one  must  be  careful 
to  design  algorithms  which  keep  startup  costs  to  a minimum.  Scalar  opera- 
tions can  also  be  a source  of  inefficiency  even  when  most  of  the  code  con- 
sists of  vector  operations.  Data  manipulation  is  often  an  essential  con- 
sideration, especially  when  there  are  strong  restrictions  on  machine  vec- 
tors. These  issues  and  others  are  illustrated  in  our  comparison  of  algo- 
rithms in  Section  10,  which  is  based  on  information  for  the  CDC  STAR- 100. 
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3.  LINEAR  METHODS 


The  simplest  approach  to  direct  solution  of  a block  tridiagonal  linear 
system  Ax  = v is  block  elimination,  which  effectively  computes  an  LU  factor- 
ization of  A.  We  discuss  the  process  in  terms  of  block  elimination  on  a 
general  partitioned  matrix  under  the  assumption  j|  B[A]|j<  1;  the  process  is 
shown  to  be  linear  and  stable.  Various  properties  of  block  elimination  are 
considered,  such  as  pivoting  requirements,  bounds  on  the  condition  numbers 
of  the  pivotal  blocks,  error  propagation  in  the  back  substitution,  and 
special  techniques  for  special  systems.  Gauss- Jordan  elimination  is  studied 
for  its  numerical  properties,  which  foreshadow  our  analysis  of  the  more  impor- 
tant odd-even  elimination  and  reduction  algorithms. 

3 .A.  The  LU  Factorization 
3.A.I.  Block  Elimination 

The  block  tridiagonal  LU  factorization 


A = (aj,b.,c.)N  = (i.,I,0)N(0,d.,c.)N 


is  computed  by 


d = b 
1 1 


,-l 


dj  = bj  - ajdj-l  cj-l>  J = 2.....N, 

l3  = Vj-l’  j = 


where  the  existence  of  d^.  ^ is  assumed  and  will  be  demonstrated  when  ||  B [ A ] 1 1 < 1. 
For  the  moment  we  are  not  concerned  with  the  particular  methods  used  to  compute 
or  avoid  computing  either  Z ^ or  d^ ^ cj-l" 
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Let  A^  = A and  consider  the  matrix  A0  which  is  obtained  after  the  first 
block  elimination  step.  We  have  A0  = (a.,3.,cJN>  3^  = d^  = b^,  a ^ = 0, 

32  = d2’  aj  = aj  and  = bj  for  3 s Let  Bx  = B[A1],  B2  = B[A,]. 

Note  that  and  differ  only  in  the  second  block  row;  more  specifically, 
in  the  (2,1)  and  (2,3)  positions. 

Theorem  3.1.  If  J|  B ^ ||  < 1 then  ||  B2  ||  £ ||  B ||  . 

Proof.  Let  T = B^E^  and  let  S = TB^.  We  have  d2  = b9(I  - b0^a0b^c  ), 

II  b2  la2bllcl  II  = H SH  ^ II  B1 II  II  Ei  II  II  Bill  = II  Bil|2  < 1>  so  d21  exists  and  B„ 
is  well-defined.  Also,  B^  = (I  - S)B,  + T,  { B^ | = j (I  - S) B, | + j T | by  con- 
struction, and  o^(S)  £ 0 j (T)  1 1 B^|[  ^ P^(T).  Thus  Lemma  2.2  applies,  yielding 


QED 


Now,  for  any  partitioned  matrix  the  N stages  of  block  elimination  (no 
block  pivoting)  can  be  described  by  the  process 


for  i = 1 N-  1 

Ai+1  - (I  ' L[Ai]EiD[A1]‘1)Ai 
= (I  + L[C[Ai]]Ei)A.. 

We  transform  A^  into  by  eliminating  all  the  blocks  of  column  i below  the 

diagonal;  when  A is  block  tridiagonal  A^  is  the  upper  block  triangular  matrix 
(0,dj,Cj)^.  The  same  effect  can  be  had  by  eliminating  the  blocks  individually 
by  the  process 
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Aj  = A 

for  i = 1, . . . ,N- 1 

Af1!  = A 
l+l  l 

for  j = i+l,...,N 

A^j)  = (I  - E.Af-^E.DCA^"^  ]'1)A^"1) 
l+l  j i+l  l l+l  J l+l 


L = (I  + EjC[Ai+r  ]Ei 


i+l 


L 


A . . = a<N> 

i+l  i+l 


Each  minor  stage  affects  only  block  row  j of  the  matrix.  It  is  seen  that 

block  row  j of  A^~  ^ = block  row  j of  A., 
i+l  J l 

block  row  i of  Ap,1"*  = block  row  i of  A.. 

i+l  l 

from  which  it  follows  that 

E.A^YofAy-  Y1  = E.A.E.D[A.]-1, 
j i+l  l i+l  J j l l L iJ  ’ 

so  the  major  stages  indeed  generate  the  same  sequence  of  matrices  A^.  Let 
B.  = B[A.]. 

Corollary  3.1,  If  A is  block  tridiagonal  and  j|  B^||  < 1 then  ||  B.+1||  ^ |j  B.  ||, 
1 £ i SN-1, 


The  proof  of  Corollary  3.1  requires  a bit  more  care  than  the  proof  of  Theorem 
3.1  would  seem  to  indicate;  this  difficulty  will  be  resolved  in  Theorem  3.2. 
We  should  also  note  the  trivial  "error  relations" 


*i+i  - 'Hill  s ll*i  - *JI. 
Bi+1-  'nII  sliBi-  BJ- 
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These  follow  from  the  fact  that,  for  block  tridiagonal  matrices, 

i N 


A. 

1 


E . 


A 


+ 


~j=i+l 


EjAr 


, which  holds  regardless  of  the  value  of  j|  B ^||  . 

Wilkinson  [W3]  has  given  a result  for  Gaussian  elimination  that  is  similar 
to  Corollary  3.1,  namely  that  if  A is  strictly  diagonally  dominant  by  columns 
then  the  same  is  true  for  all  reduced  matrices  and  ||  A^+^  (reduced)  ||^  £ 

||  A^  (reduced)  j|^.  This  actually  applies  to  arbitrary  matrices  and  not  just 
the  block  tridiagonal  case.  Brenner  [B14]  anticipated  the  result  but  in 
a different  context. 

Our  next  two  theorems,  generalizing  Corollary  3.1  and  Wilkinson's  observa- 
tion, lead  to  the  characterization  of  block  elimination  as  a norm- non- increas ing 

k 

projection  method  , and  proves  by  bounding  the  growth  factors  that  block 
elimination  is  stable  when  ))b[A]||  < 1.  In  addition,  many  special  schemes  for 
block  tridiagonal  matrices  are  equivalent  to  block  elimination  on  a permuted 
matrix,  so  the  result  will  be  useful  in  some  immediate  applications. 

Theorem  3.2.  If  )| B 1||  < 1 then  ||  B.+1||  <;  ||  B.|]  and  j|  At+1||  ^ ||  A.||  for 
1 £ i S8-1. 


Proof.  Suppose  ||  B^||<  1.  We  have 

A1+1  3 (I  - L[A.]EiD[Aif1)Ai 

- A.  - L[A.]E.  + L[A1]E1B1, 

PJ(Ai+l>*  Pj(Ai  ' + Pj(L[Ai]EiB.) 

s P j(Ai  - L[A.]E.)  + oj(L[Ai]Ei)||  B.|| 

k 

The  projections  are  onto  spaces  of  matrices  with  zeros  in  appropriate  positions. 


* Pj(A.  - L[A.]E.)  + PjCLCA.JE.) 
= P.(A.), 


S°  II  Ai+lH  * II  Ai II  • Now  let 

Q = D[Ai]‘1L[Ai]EiD[Ai]'1A.> 

30  Ai+1  = Ai  ' °fAi^»  D[Ai+l]  = D^AiKI  - D[Q ])  • 

Let  S = D[ Q ] - Since 


Q = (-L[B.]E.)(I  - B.)  = -L[B.]  + L[B.]E.B., 

S = D[L[B.]E.B.]  and  ||s||*  ||L[B.]||  ||e.||  ||B.||s  ||B.j|2  < 1.  Thus 
D[A^+^]  exists  and  B^+^  is  well-defined.  It  may  be  verified  that 
(I  ' = Bi  - S + Q.  Let  T = Bl  - S + Q,  J = S + T = B + Q.  Since 

D[ Bjl ] = D[Q  - S]  = 0,  we  have  D[T]  = 0 and  |j|  = |s|  + |t|.  Thus,  by  Lemma  2.1, 
j|  j||  < 1 implies  ||  Bi+1||  ^ j|  j||  . But  we  have,  after  some  algebraic  manipula- 
tion. 


J = B.  + Q 

i 


Ci  EA 

yN 

+ / E.B,  (I  - E.  + E.B.) , 

Tc-i+1  k 1 1 11 


and  it  follows  from  this  expression  that  ||  j||  £ ||  B^||  . QED 

Although  expressed  in  terms  of  the  N- 1 major  stages  of  block  elimination, 
it  is  clear  that  we  also  have,  for  B^  = B[A^  ],  ||  B^  ||  £ ||  B^"1^  ||  and 

llAij)||  5 ||  X)  ||.  Moreover,  as  long  as  blocks  below  the  diagonal  are 
eliminated  such  inequalities  will  hold  regardless  of  the  order  of  elimination. 
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We  shall  adopt  the  name  "linear  method"  for  any  process  which  generates 
a sequence  of  matrices  such  that  ||  B[M^+^][|  s ||  B[M^]||,  given  that 
||  B[M^]||<  1.  Thus  block  elimination  is  a linear  method. 

A result  that  is  dual  to  Theorem  3.2  is  the  following. 


Definition.  Let  CL~  be  the  reduced  matrix  consisting  of  the  last  N-i+1  block 
rows  and  columns  of  A^.  = A^  = A) . 

Theorem  3.3.  If  ||  CfA]^  < 1 then  ||  C[<2.+1  ] ||  l * ||  C[£2.  ] ||r  \\^i+l\\l  ^ H^llj. 
and  ||  L[C1]E.||1  * ||  CCA]^,  C.  = C[A.]. 

Remark.  It  is  not  true  in  general  that  ||c  ||  £ | j c ^ 1 1 f HCilli<  1’ 


Proof.  Let  the  blocks  of  C . be  c , 1 £ ra,  p £ N.  Then  C\&.  1 = (c  ),-<s„ 
i mp  L mp 


mp 


Let  K = (cmp  + craicip)  i+l^n,p^ 


= ( r ) , , •+■ 

' mp  i+ls<n,p^I 


(ci,i+l  CiN)- 


Using  Lemma  2 .4 , 
But 


^i+iHl! 


if  : k L < l. 


VK>*V  E C6J?  ]) 

4 H ~k=i+l 


i+l»i\ 


'Ni 


l^qCEf  C82,]) 


+ II  cBS1]||1vq(EiCK?i]) 

s v-C1+i E* 


r 
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+ vq(Et  CE^]) 

= vq(C52.]). 

Thus  II  k||x  ^ ||c[<2i]||1<  1 and  ||  ct^i+1]  II  ± 5 ||  C^]  ||  1 . By  construction 

this  implies  ||  £ ||  C [ A ] 1 1 ^ for  1 £ i £ N.  Now,  let  the  blocks  of 

be  a , 1 ^ m,  p s n.  Then  CL.  , = (a  + c .a.  and 

mp  l+l  mp  mi  ip  i+l^n,p^l 

s V^+1  Etf> 


llVq(Ei°l) 


' X,  «V  ■ 

QED 


We  note  that  if  A is  symmetric  then  C[A]  = B[A ] and  Theorems  3.2  and  3.3 
become  nearly  identical.  Of  course,  B and  C are  always  similar  since  C =*  DBD 
When  A is  positive  definite,  Cline  [ C 3 ] has  shown  that  the  eigenvalues  of  42 
interlace  those  of  42^  Actually,  the  proof  depends  on  use  of  the  Cholesky 
factorization,  but  the  reduced  matrices  are  the  same  in  both  methods.  In 
consequence  of  Cline's  result  we  have  ||^Z  ||  5 • 

The  statements  of  Theorems  3.2  and  3.3  depend  heavily  on  the  particular 


norms  that  were  used.  It  is  not  true  in  general,  for  example,  that  if 
||B[A1]||1<  1 then  ||  B[A2]  ||j  £ ||  B[Aj]  ||j  or  even  ||  B[£,  ] ^ s ||  B[4^ ] ||r  This 
makes  the  extension  to  other  norms  quite  difficult  to  determine. 
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3. A. 2.  Further  Properties  of  Block  Elimination 

It  is  well-known  that  the  block  factorization  A - LAU  is  given  by 
N 

L - I - ; L[c  ]E  , 

S-i  J J 

A = D[A^1, 

N 

U = I - B =*  I - / E.  U[B  ]. 

N “j-l  J J 

We  have  therefore  shown  in  Theorems  3.2  and  3.3  that  if  ||  1 then 

II  L||x  * 1 + II  C1||  J_  <2,  ||  L||  * 1 + Nn||  C1||1,  n = maXj  n . , and  if  ||  B1||  < 1 
then  ||  U||  * 1 + ||  B x j|  <2,  ||  u||  l £ 1 + Nn||  bJI  . (For  block  tridiagonal 
matrices  the  factor  N can  be  removed,  so  |j  L||  £ 1 + nj|  C ||^,  ||  £ 1 + n||  B^j|.) 

Moreover,  the  factorization  can  be  completed  without  permuting  the  block  rows 
of  A.  The  purpose  of  block  partial  pivoting  for  arbitrary  matrices  is  to 
force  the  inequality  ||  L[ C ^ ]E ^ 1 1 £ 1.  Broyden  [B15]  contains  a number  of 
interesting  results  related  to  pivoting  in  Gaussian  elimination.  In  particular 
it  is  shown  that  partial  pivoting  has  the  effect  of  bounding  the  condition 
number  of  L,  though  the  bound  and  condition  number  can  actually  be  quite 
large  in  the  general  case.  We  will  have  some  more  comments  on  pivoting  after 
looking  at  some  matrix  properties  that  are  invariant  under  block  elimination 
(cf.  [311]). 


Corollary  3.2.  Block  elimination  preserves  strict  diagonal  dominance. 


Proof . It  follows  from  Theorem  3.2  that  Gaussian  elimination  preserves  strict 
diagonal  dominance:  simply  partition  A into  lxl  blocks.  Now  suppose  we  have 
A^  and  are  about  to  generate  A This  can  be  done  either  by  one  stage  of 

block  elimination  or  by  n^  stages  of  Gaussian  elimination,  for  both  processes 
produce  the  same  result  when  using  exact  arithmetic  [HS,  p.  130].  Thus  if  A^ 
is  strictly  diagonally  dominant,  then  so  is  QED 


r 
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It  is  shown  in  [B4]  that  Gaussian  elimination  maps  G-matrices  into 
G-matrices.  This  can  be  extended  in  an  obvious  fashion  to  the  case  of  block 
elimination.  We  also  have 

Corollary  3.3.  Block  elimination  maps  H-matrices  into  H-matrices. 

Proof.  Recall  that  A is  an  H-matrix  if  and  only  if  there  exists  a nonsingular 
diagonal  matrix  E such  that  A'  = AE  is  strictly  diagonally  dominant.  Con- 
sider the  sequence 


A^  = A' , 

A^+1  = (I  - L[A'.]E.  D[A’]-1)A^. 

Clearly  A^  = AjE;  suppose  that  A^  = A^E  is  strictly  diagonally  dominant.  We 
then  have 

Ai+1  = (I  " L[Ai]E  EiE“1D[Aif1)A1E 
= (I  - L[At]E.  DrA.f^A.E 

= Ai+1E’ 

so  A^+^  is  an  H-matrix  since  A^_^  is  strictly  diagonally  dominant  by  Corollary 
3.2.  QED 

We  have  chosen  to  give  this  more  concrete  proof  of  Corollary  3.3  because 
it  allows  us  to  estimate  another  norm  of  B[A]  when  A is  an  H-matrix.  Define 
I|m|!e=  ||  E ^ME 1 1.  Then  since  A'  = AE  implies  B[A' ] = E ^B[A]E  it  follows 
that  if  A is  an  H-matrix  then  ||  BCAi+1]||E  * ||  BCA^H  . Since  the  matrix  E 
will  generally  remain  unknown  this  is  in  the  nature  of  an  existential  proof. 

The  proof  of  Corollary  3.2  brings  out  an  important  interpretation  of 
Theorem  3.2.  If  block  elimination  is  actually  performed  then  our  result 


gives  direct  information  about  properties  of  the  reduced  matrices.  If  not, 
and  ordinary  Gaussian  elimination  is  used  instead,  then  Theorem  3.2  pro- 
vides information  about  the  process  after  each  n^  steps.  While  we  are  not 
assured  that  pivoting  will  not  be  necessary,  we  do  know  that  pivot  selec- 
tions for  these  n.  steps  will  be  limited  to  elements  within  the  d.  block 
for  the  block  tridiagonal  LU  factorization.  When  A has  been  partitioned  to 
deal  with  mass  storage  hierarchies  this  will  be  important  knowledge  (see 
Section  10. A). 

One  more  consequence  of  Theorem  3.2  is 


Corollary  3.4.  In  the  block  tridiagonal  LU  factorization,  if  Jj  B^|j  < 1 then 
cond(dj)  = ||  d ||  ||  d" 1 1|  < 2 cond(b.)  / (1  - ||  B1||2  ). 

Proof.  Letting  a , = b . *a  .d . \c  . . , we  have  d.  = b.(I  - a.),  ||a\||  £ 

J J J J-l  J-l  J J j 11  j" 

II  Bx||  ||  Bn||  * ||  Bjf  < 1,  so  that  ||  d ||  s ||  b ||  (1  + ||  o ||  ) < 2 ||  b ||,  and 


* b 


(I-  a.)"1!!*  Hb-.1!!/  (1-  Hcjll)  ^ ||  b”1 1|  / (1-  Hb^I2). 


It  is  important  to  have  bounds  on  cond(d^)  since  systems  involving  d^ 

must  be  solved  during  the  factorization.  Of  course,  a posteriori  bounds  on 

||  d.^||  may  be  computed  at  a moderate  cost.  We  note  also  that 

cond(A)  ^ maXj  2 cond(b^)  / (1  - ||  B^H  ) when  ||  B^||  < 1,  and  it  is  entirely 

possible  for  the  individual  blocks  b.  and  d.  to  be  better-conditioned  than  A. 

J j 

The  inequalities  in  Theorem  3.2  are  the  best  possible  under  the  condition 
||  B^ll  < 1.  This  is  because,  for  example,  the  first  block  rows  of  B^  and  B? 
are  identical.  However,  it  is  possible  to  derive  stronger  results  by  using 
different  assumptions  about  B.. . 


V 
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Theorem  3 .4  . [H4  ] For  A = (a^  ,b  ,c  ^)  , let  \(A)  = max^  (4  ||  *a^  ||  ||b^  ^c  ^ ||  ) , 

e j = b'V.  If  X(A)  S 1 then  ||  I - e j ||  £ (1  - Jw\)/2  and  ||  1 1|  <:  2/(1  + v^) . 

Some  useful  consequences  of  this  theorem  include  better  estimates  for 

II  B.,  ||  and  a smaller  upper  bound  on  the  condition  numbers  of  the  d.'s.  For  we 
N J 

have  B^  = (0,0,-dj^cJ  = (0,0, -e^b^  ^c^)  , which  implies  that 

||  B^jU  ^ 2 ||  U[B^ ] ||  / (1  + v/1- X)  . As  an  example , if  A = (1,3,1)  then  Theorem 

3.2  predicts  only  |)  B^H  ^ ||  B^||  = 2/3,  while  by  Theorem  3.4  we  obtain 

||  Bj|  £ 2(1/3)  / (1  + ,/l-4/9)  * 0.382.  Also,  ||  d“  1 1|  £ ||  d~  Lb  ^ ||  ||b~  1 1|  = 

l|a'j1||  l|b j 1 II  S -||b j 1 ||»  and  ||d.  ||  * |^-  d . ||  + |^  ||  - INjd  “ -^  ||  + |^  ||  ^ 

ll  b j ||  ||  I - e ^ |j  + ||  b.  j|  £ 3 ||  b^l/2.  It  follows  that  cond(dj)  = ||  d"  X|1  ||d  j ||  s 

3 cond(bj)  . More  precisely,  cond(dj)  ^ (1  + 2 u ( A.)  ) cond(bJ  , u * ( 1 - • '1-  \)/(  1 + »/l-  X)  . 

This  is  further  proof  of  the  stability  of  the  block  LU  factorization. 

Theorem  3.4  has  been  used  in  the  analysis  of  an  iteration  to  compute  the 
block  diagonal  matrix  (0,d^,0)^  [ H4 ] . Such  methods  are  potentially  useful 
in  the  treatment  of  time- dependent  problems,  where  good  initial  approximations 
are  available.  This  particular  iteration  is  designed  for  use  on  a parallel 
computer  and  has  several  remarkable  properties;  for  details,  see  [H4]  and 
Section  8.E. 

Two  results  on  the  block  LU  factorization  for  block  tridiagonal  matrices 
were  previously  obtained  by  Varah. 

Theorem  3.5.  [ Vl ] If  A is  block  diagonally  dominant  (||  b^||(||a^||+  | (c ^ 1 1)  s 1) 
and  ||  Cj||  J*  0,  then  ||  d~ 1 1|  * ||  c^f1,  ||  ||  s ||  ||  / ||  and 

11^11*11^11+  II  aj  II* 
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In  comparison  to  Theorem  3.2,  the  bound  on  ||  d^||  corresponds  to  the 

bound  ||  < w^ich  follows  from  ||  B^U  < 1,  and  the  bound  on  ||  d^  || 

corresponds  to  the  bound  ||a^||  ^ ||a||.  In  contrast  to  Theorem  3.2,  j|  B ||  < 1 

“1  "1  2 

only  allows  us  to  conclude  ||  d^  ||  5 ||  b^  ||  / (1  - j|  B^||  ),  and  we  can  only 
estimate  ||  ^ ||  in  the  weak  form  ||  ijcj_1||  * ||  a.  ||  ||  dj^c^JI  5 ||  « ||. 

Varah's  second  result  is 


Theorem  3.6.  [Vl]  Let  a.  = (||  bj^ajll  II  ^j\Cj  ill  2 £ j £ N,  and  sup- 

pose a.  P 0.  Then  the  block  LU  factorization  of  A is  numerically  stable 
J 

(i.e.,  there  exists  a constant  K(A)  such  that  max(||  l ||  , ||  d^  ||  ) £ K(A))  if 
(cty  1 , o' j is  positive  semidef  inite. 

We  note  that  (<*^,1,3.  is  a symmetrization  of  ( 1 1 b . ^a  ^ ||,  1 , | j b ^ ^c  ^ 1 1 ) 
and  that  \(A)  £ 1 implies  that  it  is  positive  definite.  The  actual  bounds 
on  ||  2^  ||,  ||  d j 1 1 given  by  Varah  [VI]  for  this  case  will  not  be  important  for 

this  discussion. 

Since  many  block  tridiagonal  systems  arise  from  the  discretization  of 
differential  equations,  a number  of  results  have  been  obtained  under  assump- 
tions closely  related  to  the  original  continuous  problem.  Rosser  [R7]  shows 
that  if  the  system  Ax  = v,  A = (aj,bj,Cj),  satisfies  a certain  maximum 
principle  then  A is  nonsingular,  the  LU  factorization  exists  (each  d^  is  non- 
singular) , B^  is  non-negative  and  ||b^  ||  S 1.  Improved  bounds  on  B^  are 
obtained  when  A is  defined  by  the  nine-point  difference  approximation  to 
Poisson's  equation.  In  particular  there  is  the  remarkable  observation  that 
the  spectral  condition  number  of  d^  is  less  than  17/3.  Such  results  depend 
strongly  on  the  particular  matrix  under  consideration.  For  examples  of 
related  work,  see  [B18],  [Cl],  [D2]. 
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3. A. 3.  Back  Substitution 

So  far  we  have  only  considered  the  elimination  phase  of  the  solution  of 
Ax  = v.  For  block  tridiagonal  matrices  A = (a^.b^.c^^  this  consists  of 
the  process 

di = V fi = v 

d.  = b - a.d.^.c.  f.  = v.  - a.d.^.f. 

J j J 3-1  3-1  3 3 3 3-1  j-1* 

3 = 2 , . . • ,N. 

The  back  substitution  phase  consists  of  the  process 


XN  = dN  fN’ 

X3  = dj1(f3  " CjXJ+l)*  j = 


In  matrix  terms  these  are  the  processes 


Al  - A,  f(1)  = v, 

A.+1  - (I  + L[C.]E.)A.  , i =*  1,...,N-1, 

f(L+1)  = (I  + L[C.]E.)f(L)  , i = 1,...,N-1, 

(N)  „r»  n-lr(N> 

g = x D[A^]  fv  , 


x(l)  =»  g + EjB  x(l+1),  i =*  N- 1 1 , 


X 53  X 


(1) 


An  alternate  form  of  back  substitution  is 


x(N)  - D[AN]’1f(N) 


.(1) 


X 3 X 


(I  + BNE.  + 1)x(i+1),  i = N- 1 , . . . , 1 , 


(1) 
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In  either  case  it  is  seen  that  plays  an  essential  role  in  studying  the 
back-substitution  process. 

For  those  situations  where  a band  elimination  method  for  Ax  = v might  be 

preferred,  the  matrix  maintains  its  importance  for  back  substitution.  We 

have  A = (a.,b.,c.)  = (a . , l . ,0)  (0,u . ,c  .)  , d.  = i.u.,  a.  = a .u . _,  c.  = l,  c., 
JJJ  J J JJJ  JJJ  J J-l  J JJ 

A A 

where  .^(u^)  is  lower  (upper)  triangular.  The  elimination  phase  generates 

* * 1 * - 1 

vectors  f.  = 2.f.,  and  the  back  substitution  computes  x^  = u^  f^  = f , 

A-l  * A . -1 

x.  = u.  (f.  - c.x.,,)  = d.  (f.  - c.x.,..).  Except  for  the  specific  form  of 
J J V J J J+l  J J J J+l 

roundoff  errors  the  band  and  block  back  substitutions  must  behave  identically. 
Let  us  briefly  consider  the  effect  of  rounding  errors  in  back  substitu- 


tion; the  computed  solution  will  be  y . = x . + * and  we  require  bounds  on  f . 
Suppose  the  computed  forward  elimination  yields  d^  = d^  + 5 , f^  = f^  -4-  cp ^ . 

JU 

Let  7).  represent  the  errors  introduced  in  forming  f.  - c.y.  ,,  so  that  y. 

J J J J+l  J 

* (2)  * 

satisfies  (d . + 6.  )y.  = f.  - c.y.  , + T|.  if  ordinary  Gaussian  elimination 

J J J J J J+l  J 

* 

is  used  to  solve  for  y.  in  d.  y.  = f.  - c.y.,,  + T|..  It  follows  that 

J J 'J  J J J+l  J 

v.  = d.^(f.  - c.y.,,)  + e.,  where  e.  = d.^(®.  + 1.  - (6^  + 6^  )y.),  which 
J J J J J+l  j j J J j J j j 

implies  = - djlcj?j+l-  Thus  H Sjll  5 II  BNH  II  ?j+lH+  II  ejH»  and  the 

previous  errors  tend  to  be  damped  out  in  later  computations  if  ||b^||<  1.  If 

we  have  a uniform  upper  bound  for  the  e^'s,  ||  ||  £ e,  and  II  Bj,||  < 1,  then 


we 


obtain  J!  Sj||  5 (N  - j + 1)  e>  ||  J]  * d - ||  j+1)  e/ d - \^\\>  < •/  (1-  |)bn|J) 


3. A. 4.  Special  Techniques 

In  some  cases  the  matrix  A = (a^b^c^)  Possesses  special  properties 
that  may  be  important  to  preserve  in  order  to  satisfy  a strong  stability 
criterion.  For  example,  we  might  have  b 


j 


t . - a , - c . where 
J j J 


t j ||  is  small 


or  zero;  this  requires  that  all  blocks  be  n X n.  Scalar  tridiagonal  systems 


J 


r 
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with  this  property  arise  from  the  discretization  of  second  order  ODE  boundary 
value  problems.  Babuska  [Bl-3]  has  discussed  a combined  LU  - UL  factoriza- 
tion for  such  a matrix  which  allows  a very  effective  backward  error  analysis 
to  the  ODE  and  a stable  numerical  solution  of  a difficult  singular  pertur- 
bation problem  posed  by  Dorr  [ D3  ] . (See  [H7]  for  a symbolic  approach  to 
this  problem.) 

There  are  two  essential  features  to  Babuska' s method.  The  first  is 
to  express  b^  and  d^  indirectly  by  starting  with  t ^ and  computing  the  se- 
quence s^  = t^,  Sj  = tj  - ]_•  By  this  we  have  d = s.  - c^.  The 

second  feature  is  to  augment  the  forward  elimination  with  a backward  elim- 
ination and  dispense  with  the  back  substitution  by  combining  the  results 
of  elimination.  Specifically,  suppose  A = (— p j , t ^ + p^.  + q^.-q^)^. 

Pj.q^.tj  £ 0,  p^  = q^  = 0,  A nonsingular.  The  algorithm  for  Ax  = v is 


V fl 


V SN 


V fN  = V 

-1 


Sj  = Cj  + Pj(Sj-l+  qj-l}  SM 

£j  ' vj  + pj<sj-i  + 

★ ★ _ \ ★ 
s,  = t + q .(s , , . + p^.)  s 


j 


*j 'j+l  rj+i  “j+i 


f.  ■ v + q . (s  . . + p.  . 

J j V J+l  KJ+1  J+l 


j 2 , . . . ,N, 


j - N- 


(s.  + s*  - tj)_1(fj  + f * - v.),  j = 1, . . . ,N. 


In  matrix  terms,  we  factor  A = A^  + AQ  + A^  = (I  + L) (D  + Ay)  ■ (I  + U ) (D  + A^) 
and  solve  (I  + L) f ■ v,  (I  + U ) f * v.  Adding  (D  + A^)x  = f,  (D  + A^)x  = f 

★ ★ "k 

and  -Aqx  “ -Aqx  we  obtain  (D  + D - Ap)x  = f + f - Ax  = f + f -v. 


J 
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Babuska's  a^^orithm  has  some  unusual  round-off  properties  (cf.  [M3]) 
and  underflow  is  a problem  though  correctable  by  scaling.  It  handles  small 
row  sums  by  taking  account  of  dependent  error  distributions  in  the  data, 
and  thus  fits  into  Kahan's  notion  that  "conserving  confluence  curbs  ill- 
condition"  [Kl]. 

To  see  how  the  method  fits  in  with  our  results,  we  note  that,  for  n = 1, 

since  p.,q.,t  i 0 we  have  s £ 0,  s = t.  + p.d'^s.  , £ t..  Thus  the  row 
J J J 1 J J J J-l  J-l  J 

sums  can  only  increase  during  the  factorization.  On  the  other  hand,  they 

will  not  increase  much  since  s.“t.+p.(s,,+q,  ,)^s  . £ t.  + p.. 

J J J J-l  J-l  j-l  J KJ 

These  are  analogues  of  the  results  in  Theorem  3.2. 


3 . B.  Gauss-Jordan  Elimination 

An  algorithm  closely  related  to  block  elimination  is  the  block  Gauss- 
Jordan  elimination 

Aj  = A, 

Ai+i  - (I  + C[A.1E.)A.,  i = 1, . . . ,N. 


This  avoids  back  substitution  in  the  solution  of  Ax  = v since  A^_^  is  block 
diagonal.  The  rest  of  the  Gauss-Jordan  process  is 


-f(D 


f (1+1)  - (i  + 


C[A.  ]E.)fa) 


1 ) • • • 


Theorem  3.7.  Let  B.  - B[A,].  If  ||  B^l  < 1 then  ||A.+1||  £ ||  A.  ||  and 

II  ®i+ill s II  II for  L = 
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Sketch  of  Proof.  The  proof  parallels  that  of  Theorem  3.2.  Note  = B^; 
suppose  ||  B^  ||  < 1.  Let  F^  = L[A^]  + UfA^],  so  that  A^+^  = (I  - F.E.D[A  ]”*)A.. 
= A.  - F.E.  + F.E.B..  This  will  show  the  first  part.  Now  define 

i ii  ill. 

Q = D[Ai]‘1F.E.D[A.]'1Ai,  so  that  A.+1  = A.  - D[A.]Q,  D[A.+1]  = D[A.](I  - S)  , 

S = D[Q ] . We  have  Q = (-  B.E.)(I  - B.)  = - B.E.  + B.E.B.,  so  S = D[B.E.B.] 

1 i l ii  ill  ill 

and  ||  S ||  £ ||  B.J|  < 1.  Also,  (I  - S)B.+1  =■  B.  - S + Q.  Let  T = L - S + Q, 

J = S + T,  so  that  | J | = j S | + | T | , and  Lemma  2 . 1 applies  if  ||  J j|  < 1.  But 
J = B^(I  - E^  + E^B^)  , so  ||  J 1 1 ^ ||  B ^ 1 1 . From  Lemma  2.1  we  conclude 


QED 


When  applied  to  block  tridiagonal  matrices  the  block  Gauss-Jordan  algo- 
rithm requires  many  more  arithmetic  operations  than  the  block  LU  factoriza- 
tion, and  cannot  be  recommended  on  this  basis.  However,  it  has  some  numer- 
ical properties  that  foreshadow  our  analysis  of  the  more  important  odd-even 
reduction  algorithm. 

The  Gauss-Jordan  elimination  on  block  tridiagonals  may  be  expressed  in 
the  following  way: 


for  j = 1, . . . ,N-  1 


r : 1 


We  observe  that,  for  1 S i <j+k  SN,  -d~  1e  . , . =•  u .u u , , , , 

J J,j+k  J J+j.  j+k-T 

which  is  the  (j,j  4-  k)  block  of  B^.  Thus  we  are  able  to  refine  Theorem  3.7 
by  giving  closer  bounds  on  the  first  i-1  block  rows  of  B.. 

l 

Corollary  3.5.  If  ||b^||  <1  and  (point)  row  L is  contained  in  block  row 
j with  1 ^ j S i-1,  then 


If  (point)  row  l is  contained  in  block  row  j with  i £ j £ N,  then 

VV  = W * II  Bill  • 

This  is  our  first  result  on  the  convergence  to  zero  of  portions  of  a 
matrix  during  the  reduction  to  a block  diagonal  matrix.  It  also  provides 
a rather  different  situation  from  Gauss-Jordan  elimination  with  partial 
pivoting  applied  to  an  arbitrary  matrix  [PI].  In  that  case  the  multipliers 
above  the  main  diagonal  can  be  quite  large,  while  we  have  seen  for  block 


36 


tridiagonal  matrices  with  j|  B^||  < 1 that  the  multipliers  actually  decrease 
in  size. 

3 .C . Parallel  Computation 

We  close  this  section  with  some  remarks  on  the  applications  and  inher- 
ent parallelism  of  the  methods;  see  also  [H3].  When  a small  number  of 
parallel  processors  are  available  the  block  and  band  LU  factorizations 
are  natural  approaches.  This  is  because  the  solution  (or  reduction  to  tri- 
angular form)  of  the  block  system  d^u.  = -c  . is  naturally  expressed  in 
terms  of  vector  operations  on  rows  of  the  augmented  matrix  (d^|-Cj),  and 
these  vector  operations  can  be  executed  efficiently  in  parallel.  However, 
we  often  cannot  make  efficient  use  of  a pipeline  processor  since  the  vec- 
tor lengths  might  not  be  long  enough  to  overcome  the  vector  startup  costs. 
Our  comparison  of  algorithms  for  a pipeline  processor  (Section  10. C)  will 
therefore  consider  the  LU  factorization  executed  in  scalar  mode.  We  also 
note  that  the  final  stage  of  Babuska's  LU-UL  factorization  is  the  solution 
of  a block  diagonal  system,  which  is  entirely  parallel. 

On  the  other  hand,  the  storage  problems  created  by  fill-in  with  these 
methods  when  the  blocks  are  originally  sparse  are  well-known.  For  large 
systems  derived  from  elliptic  equations  this  is  often  enough  to  disqualify 
general  elimination  methods  as  competitive  algorithms.  The  block  elimina- 
tion methods  will  be  useful  when  the  block  sizes  are  small  enough  so  that 
they  can  be  represented  directly  as  full  matrices,  or  when  a stable  implicit 
representation  can  be  used  instead. 
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4.  QUADRATIC  METHODS 


In  Section  3 we  have  outlined  some  properties  of  the  block  LU  factor- 
ization and  block  Gauss-Jordan  elimination.  The  properties  are  character- 
ized by  the  generation  of  a sequence  of  matrices  >L  such  that 
||  B[Mi+^]||  ^ ||  B[1L]||.  From  this  inequality  we  adopted  the  name  "linear 
methods".  Now  we  consider  some  quadratic  methods,  which  are  characterized 
by  the  inequality  ||  B[Mi+1]||  £ ||  B[Mi ]2  || . 

There  are  two  basic  quadratic  methods,  odd-even  elimination  and  odd- 
even  reduction.  The  latter  usually  goes  under  the  name  cyclic  reduction, 
and  can  be  regarded  as  a compact  version  of  the  former.  We  will  show  that 
odd-even  elimination  is  a variation  of  the  quadratic  Newton  iteration  for 
A \ while  odd-even  reduction  is  a special  case  of  the  more  general  cyclic 
reduction  technique  of  Hageman  and  Varga  [HI]  and  is  equivalent  to  block 
elimination  on  a permuted  system.  Both  methods  are  ideally  suited  for  use 
on  a parallel  computer,  as  many  of  the  quantities  involved  may  be  computed 
independently  of  the  others  [H3].  Semidirect  methods  based  on  the  quadratic 
properties  are  discussed  in  Section  7. 


4 .A.  Odd-Even  Elimination 

We  first  consider  odd-even  elimination.  Pick  three  consecutive  block 
equations  from  the  system  Ax  = v,  A = (a^.b^jCj): 


V1V2  + bk-iVi  + ck-ixk  = Vi 
Vk-i  + Vk  + Vfen  “ \ 

ak+lXk  + hk+l^+l  + Ck+lXk+2  = vk+l 


If  we  multiply  equation  k- 1 by 


-\b 


-1 

k-  1 ’ 


equation  k+1  by 


■Ckbkil*  and  3dd» 


the 


result  is 
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(4.2) 


("akbk-lak-l)xk-2 

+ (bk  ' Vk-Vk-l  - Ckbk!iak+l)xk 

+ (-Ckbk+lCk+l)xk+2 

= (vk  " - CkClVl)- 


For  k = 1 or  N there  are  only  two  equations  involved,  and  the  necessary 
modifications  should  be  obvious.  The  name  odd-even  comes  from  the  fact 
that  if  k is  even  then  the  new  equation  has  eliminated  the  odd  unknowns 
and  left  the  even  ones.  This  is  not  the  only  possible  row  operation  to 
achieve  this  effect,  but  others  differ  only  by  scaling  and  possible  use  of 
a matrix  commutativity  condition. 

By  collecting  the  new  equations  into  the  block  pentadiagonal  system 

(2) 

H^x  - v , it  is  seen  that  the  row  eliminations  have  preserved  the  fact 
that  the  matrix  has  only  three  non-zero  diagonals,  but  they  now  are  farther 
apart.  A similar  set  of  operations  may  again  be  applied,  combining  equa- 
tions k-2,  k and  k+2  of  to  produce  a new  equation  involving  the  unknowns 

f 3) 

x^_^>  x^  and  x^^ • These  new  equations  form  the  system  H^x  = vv  . 

The  transformations  from  one  system  to  the  next  may  be  succinctly  ex- 
pressed in  the  following  way.  Let  = A,  v^  = v,  m = flog,,  N*1  , and  define 

Hi+1  = (I  + CfV)Hi’ 

v(l+n  - (I  + C[H.])v(l),  i - 1,2,. ..,m. 

Block  diagrams  of  the  H sequence  for  N = 15  are  shown  in  Figure  4.1a.  The 

distance  between  the  three  non-zero  diagonals  doubles  at  each  stage  until 

the  block  diagonal  matrix  H , is  obtained;  the  solution  x is  then  obtained 

nH-l 

, - 1 (nH-1) 

by  x = H ,vv 
7 nH-1 
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We  note  that  the  same  elimination  principle  may  be  applied  to  matrices 
of  the  periodic  tridiagonal  form 


It  is  only  necessary  to  include  row  operations  that  "wrap  around"  the  matrix, 
such  as 

row(N)  - a b 1 row(N-l)  - c b 1 row(l). 

N N-  1 N I 

An  H sequence  for  N 3 16  is  diagrammed  in  Figure  4.1b,  and  is  constructed  by 
the  same  general  rule  as  before,  though  there  are  serious  and  interesting 
complications  if  N is  not  a power  of  2. 

Clearly,  we  can  define  an  H sequence  for  any  partitioned  matrix  A,  and 
only  have  to  establish  conditions  under  which  the  sequence  may  be  continued. 

In  fact,  the  sequence  ||b[H^]||  is  quadratic  when  ||  B[A]||  < 1. 

Theorem  4.1.  If  ||  B[A]  ||  < 1 then  ||  B[H1+1j||  * ||  B[H._]2  ||  and  ||  H.+1 1|  ^ ||  H . ||  . 

If  ||  C[A  ] 1 1 x * 1 then  ||  C[H1+1]  ||x  * ||  C^]2^  and  ||  * ||  hJj  . 

Proof.  We  have  1^  - A,  Hi+1  = (I  + = (21  - = 

2H.  - H1D[H1]_1H1  - Hi(2l  - D[Ht]“  1H£)  » H^I  + B[H.  ])  . Suppose  ||  B[H^]  ||  < 1, 

II  C[H^]||j<  These  assumptions  are  independent  of  one  another  and  will 
be  kept  separate  in  the  proof.  Since  Hi  * DfH^JCI  - = (I  - CfH^pDfH^], 

we  have  H1+1  a D[H.](I  - B[HJ2)  - (I  - C[H1]2)D[Hi].  Setting  S - D[B[Hi]2  ] , 
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T - D[C[Hi  ]2  ] , we  obtain  D[Hi+1]  3 D[Hi](I  - S)  3 (I  - T)D[Hi] . Since 

||  Sj|  s ||  B[  HL  ]2  ||  < 1,  ||  T ||  x * ||  CCHi]2  ||  x < 1,  and  is  invertible, 

it  follows  that  is  invertible.  We  have  B[  + ^ ] = I - D[  ] H^j3 

I - (I  - S)'1D[Hif1D[H.](I  - B[H.]2),  so  B[H._]2  3 ( I - S)B[H.+1]  + S; 

2 

similarly  CfH^]  ■ C[Hi+1](I  - T)  4-  T.  Now,  S is  block  diagonal  and  the 

block  diagonal  part  of  B[H.+^]  is  null,  so  |B[H^]  | = |(I  - S)B[Hi+^]j  + |s|. 

2 

Lemma  2.2  now  applies,  and  we  conclude  ||  B[H^+^]||  ^ ||  B[H^]  ||.  Lemma  2.4 

implies  that  ||  C[H.+1]  1^  * ||  CtH.]2^.  To  show  ||  H.+l||  * ||  H.||  if 

||B(H.]ff<  1,  write  H.  + 1 3 D[H.](I-  D[  H . ]"  ^DfH.  ] - HJBCH^)  3 D[H±J 

N 

+ (H.  - D[H. ])B[H. ].  We  then  have,  for  1 s 1 s n., 

i L 1 ~j=1  J 

Px(H1+l)  * ^(D[Ht])  + 3i(Hi  - D[H.])  ||  B[H.]|| 

£ 02(D[H.])  + C£(H.  - D[H.]) 

3 0i(H.) . 

To  show  ||  H^JIj  * ||  Hj^  if  ||  CC^JHj  < 1,  write  H.  + 1 3 (I  - C[H.1(D[H.]  - H. ) x 
D[H,.  fSorH.  ] =.  D[Hi]  + C[Hi](Hi  - D[H^])  . We  have 

Y2(Hi+l)  * VD[Hi])  + II  C"HilH  - °tHi^ 

5 Y£(D[H.  ])  + y£(H.  - D[Ht]) 

=*  v^CH^  . QED 

For  a block  tridiagonal  matrix  A we  have  BtH^^]  3 0 since  is 

block  diagonal;  thus  the  process  stops  at  this  point.  The  proof  of  Theorem 
4.1  implicitly  uses  the  convexity- like  formulae 

Hi+1  = Hi  B[Hi]  + D^Hi](I  ‘ 

3 C[Hl  Hl+  (1  - C[Ht])D[Hi]. 
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As  with  Theorems  3.2  and  3.3,  it  is  not  difficult  to  find  counterexamples 
to  show  that  Theorem  4.1  does  not  hold  for  arbitrary  matrix  norms.  In 
particular,  it  is  not  true  that  ||  B[A  ] |J2  < 1 implies  ||  B[H2  ] j(2  £ ||  B[HX]  |(2 
or  even  that  ||  B [ H2  ] 1 1 2 < 1. 

The  conclusions  of  Theorem  4.1  are  that  the  matrix  elements,  considered 
as  a whole  rather  than  individually,  do  not  increase  in  size,  and  that  the 
off-diagonal  blocks  decrease  in  size  relative  to  the  diagonal  blocks  if 
j|  B [A ] { j < 1.  The  rate  of  decrease  is  quadratic,  though  if  ||  B[A]||  is  very 
close  to  1 the  actual  decreases  may  be  quite  small  initially.  The  relative 
measure  quantifies  the  diagonal  dominance  of  H^,  and  shows  that  the  diagonal 
elements  will  always  be  relatively  larger  than  the  off-diagonals.  One  con- 
clusion that  is  not  reached  is  that  they  will  be  large  on  an  absolute  scale. 
Indeed,  consider  the  matrix  A = (-l,2+e,-l),  for  which  the  non-zero  diagonals 
of  are 


(-l,2+et,-l)  / l£:J(2+ek) 


S’  ek+l  = 4ek  + ek  ' 


For  small  e the  off-diagonal  elements  will  be  about  2^  ^ and  the  main  diagonal 
2 i. 

elements  about  2 in  magnitude.  While  not  a serious  decrease  in  size, 
periodic  rescaling  of  the  main  diagonal  to  1 would  perhaps  be  desirable. 

As  already  hinted  in  the  above  remarks,  a more  refined  estimate  of 
||  B[HX+X]||  may  be  made  when  dealing  with  tridiagonal  matrices  with  constant 
diagonals  [S2].  Let  A = (a,b,a) , so  that 


1 


43 


b-a2  b 


-a2/b 


b-2a2/b 


-a 


b-2a2/b 


-a2  b 


- a b 


2 |a2/b I / j b-2a^/b I 


-a 

b-2a2.'b 

0 


b2-a2/b  / 


It  follows  that  when  2|a|  < jbj,  ||  B[H?]| 

'(2  - ||B[H1]2||),  and  thus  |j|  B^T  ||  < |f  B[H2  ] ||  < ||b[H1]2||.  Note  that 
does  not  have  constant  diagonals,  nor  will  any  of  the  other  H matrices, 
but  most  of  the  rows  of  H,  are  essentially  identical,  and  it  is  these  rows 
which  determine  |j  B [ ] f J - This  property  will  be  important  for  the  odd-even 
reduction  algorithms. 

When  A.  = (a^,b  ,Cj)  is  block,  diagonally  dominant,  we  can  write 
Hi  ' (aJi),0,...,0,bj(i),0,...,0,cja)),  S<D  = !|(b^i))-1||(|^i)||+  |Mji>  il  ) , 


,(i) 


,(i) 


maXj  j*  ' Without  proving  it  here,  we  can  show  that  if  3'‘i'  <1  then 


3 ( i"*"  1 ) s 3^2.  An  irreducibility  condition  plus  0^  £ 1 yields  the  same 
conclusion. 


We  also  note  that  if  A is  symmetric  then  so  is  IL,  i a 1,  and  thus 


B[Hi] 


Theorem  4.2.  If  A is  block  tridiagonal  and  positive  definite  then  so  is  H., 
i i 1,  and  P(B[H.])  < 1. 


Proof.  First  we  observe  that  each  is  a 2-cyclic  matrix  [V3]  since  it  can 
be  permuted  into  block  tridiagonal  form.  Suppose  that  is  positive  definite. 
Since  * D[H^]  is  also  positive  definite,  there  exists  a positive  definite 


matrix  such  that 


E2t.  ut  so  ft1+1-  2a.  - a2; 
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IL  and  Hi+1  are  positive  definite  iff  Hi  and  Hi+1  are  positive  definite. 

Since  is  a positive  definite  2-cyclic  matrix,  we  know  from  Corollary  2 
of  Theorem  3.6  and  Corollary  1 of  Theorem  4.3  of  [V3]  that  P (B^) < 1,  5^=  I - 1L 
Since  = I - D^IL  = E^B.EL  , f°lf°ws  that  p (B^)  < 1 also.  Let  X be 
an  eigenvalue  of  H^.  Then  -1  < 1-X  < 1 since  PClL)  < 1,  and  so  we  have 
0 < X < 2 and  0<2X  - X^  Si.  But  2X  - X^  is  an  eigenvalue  of  = 2H^-H^, 

which  is  therefore  positive  definite.  QED 

Any  quadratic  convergence  result  such  as  Theorem  4.1  calls  to  mind  the 
Newton  iteration.  If  ‘Ji(Y,Z)  = Y(2I  - ZY)  = (2 1 - YZ)Y,  Y(0)  is  any  matrix 
with  ||  I - ZY  ^||a<  1 for  some  norm  ||  • ||a,  and  Y^i+^  = 3l(Y^  ,Z)  , then 
Y^  converges  to  Z 1 if  it  exists  and  (I  - ZY^i+^)  = (I  - ZY^)^[H8]. 

It  may  be  observed  that  = 3l(H^,D[H^]  ^)  , so  while  the  H sequence  is  not 

a Newton  sequence  there  is  a close  relation  between  the  two.  Nor  is  it 
entirely  unexpected  that  Newton's  method  should  appear,  cf.  Yun  [Yl]  and 
Kung  [K2]  for  other  occurrences  of  Newton's  method  in  algebraic  computations. 
Kung  and  Traub  [K3]  give  a powerful  unification  of  such  techniques. 

The  Newton  iteration  Y^+^  * 2t(Y^  ,Z)  is  effective  because,  in  a 
certain  sense,  it  takes  aim  on  the  matrix  Z ^ and  always  proceeds  towards  it. 

We  have  seen  that  odd-even  elimination  applied  to  a block  tridiagonal  matrix 
eventually  yields  a block  diagonal  matrix.  The  general  iteration 
H.^  3 S incorporates  this  idea  by  aiming  the  i—  step  towards 

the  block  diagonal  matrix  DfH^].  Changing  the  goal  of  the  iteration  at  each 
step  might  be  thought  to  destroy  quadratic  convergence,  but  we  have  seen 
that  it  still  holds  in  the  sequence  ||  B [ ] 1 1 - 


4.B.  Variants  of  Odd- Even  Elimination 


There  are  a number  of  variations  of  odd-even  elimination,  mostly  dis- 
cussed originally  in  terms  of  the  odd-even  reduction.  For  constant  block 
diagonals  A = (a,b,a)  with  ab  = ba,  Hockney  [H5]  multiplies  equation  k- 1 
by  -a,  equation  k by  b,  equation  k+1  by  -a,  and  adds  to  obtain 

('a2)\-2  + (b2  - 2a2)xk  + ('a2)xk+2  = (bvk  - avk-l  ' avk+l}- 

This  formulation  is  prone  to  numerical  instability  when  applied  to  the  dis- 
crete form  of  the  Poisson  equation,  and  a stabilization  has  been  given  by 
Buneman  (see  [B20]  and  the  discussion  below). 

With  Hockney's  use  of  Fourier  analysis  the  block  tridiagonal  system 
Ax  * v is  transformed  into  a collection  of  tridiagonal  systems  of  the  form 
(- 1 , \,- 1) (sj)  = (tj)>  where  \ ^ 2 and  often  X » 2.  It  was  Hockney's  obser- 
vation that  the  reduction,  which  generates  a sequence  of  Toeplitz  tridiagonal 
matrices  (-1,X^,-1),  could  be  stopped  when  l/X^  fell  below  the  machine  pre- 
cision. Then  the  tridiagonal  system  was  essentially  diagonal  and  could  be 
solved  as  such  without  damage  to  the  solution  of  the  Poisson  equation. 

Stone  [S2],  in  recommending  a variation  of  cyclic  reduction  for  the 
parallel  solution  of  general  tridiagonal  systems,  proved  quadratic  conver- 
gence of  the  ratios  l/X^  and  thus  of  the  B matrices  for  the  constant 
diagonal  case.  Jordan  [Jl],  independently  of  our  own  work  [H2],  also  extend- 
ed Stone's  result  to  general  tridiagonals  and  the  odd-even  elimination  algo- 
rithm. The  results  given  here  apply  to  general  block  systems,  and  are  not 
restricted  solely  to  the  block  tridiagonal  case,  though  this  will  be  the 
major  application. 
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We  now  show  that  quadratic  convergence  results  hold  for  other  formu- 
lations of  the  odd-even  elimination;  we  will  see  later  how  these  results 
carry  over  to  odd-even  reduction.  In  order  to  subsume  several  algorithms 
into  one,  consider  the  sequences  = A,  v^  = v, 

»i+i  • Fi<2°rsi)  - 

v(l+1>  = F . (2D[H. 1 - 

1 L 1J  11  ’ 


where  F^,  G^  are  nonsingular  block  diagonal  matrices.  The  sequence  H cor- 
responds to  the  choices  F.  = I,  G.  = DfH.l  . 

i l • i 

In  the  Hockney- Buneman  algorithm  for  Poisson's  equation  with  Dirichlet 
boundary  conditions  on  a rectangle  we  have 


(a,b,a) , ab  = ba,  a 

(i)>, 


a,  b 


b, 


F = diag  (bv“')»  G 

ba)  - b,  b(i+»  1 

.<»  - a,  a<i+1)  = 


DCH.f1, 


b<i)2  - 2a(i)2, 


-a 


(i)  2 


Sweet  [Sll]  also  uses  these  choices  to  develop  an  odd-even  reduction  algo- 
rithm. We  again  note  that  the  matrices  lose  the  constant  diagonal  prop- 
erty for  i 2 2,  but  symmetry  is  preserved.  Each  block  of  H.  may  be  repre- 
sented as  a bivariate  rational  function  of  a and  b;  the  scaling  by  en- 
sures that  the  blocks  of  block  rows  j2i  of  are  actually  polynomials  in 

a and  b.  In  fact,  the  diagonal  blocks  of  these  rows  will  be  b^i+^  and  the 
non-zero  off-diagonal  blocks  will  be  a^i+^. 

Swarztrauber  [S6]  describes  a Buneman-style  algorithm  for  separable 
elliptic  equations  on  a rectangle.  This  leads  to  A = (a^.b^Cj)  where  the 

blocks  are  all  n x n,  and  a.  = ry.I  , b.  = T + 3.1  , c.  = y.I  . In  this 

J J»  J > J J n 

case  the  blocks  commute  with  each  other  and  may  be  expressed  as  polynomials 
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in  the  tridiagonal  matrix  T.  Swarztrauber' s odd-even  elimination  essenti- 
ally uses  = diag(LCM(b i- 1 > bj+2i-l^  » = Tn  for  j s 0 or 

j 2 S + 1,  G.  = D[Hl ]” 1 


Stone  [S2]  considers  a similar  algorithm  for  tri- 
diagonal matrices  (n  = 1)  but  with  = diagfb^^i-l  bj+2i-l^’  Gi  = 

Note  that  symmetry  is  not  preserved. 


In  all  three  cases  it  may  be  shown  by  induction  that  H 


i+1 


F F.H.  , 

l 1 l+l 


and  v(i+1)  = F^...F^v^+^\  These  expressions  signal  the  instability  of 
the  basic  method  when  F^  ^ I [B20].  However,  in  Section  2 .A  we  showed  that 
the  B matrices  are  invariant  under  premultiplication  by  block  diagonal 


matrices,  so  that  BfH^]  * B[H^]. 


(i) 


The  Buneman  modifications  for  stability  represent  the  vv  sequence  by 

sequences  p^  , q^  with  v^  = + q^.  Defining  = D[H.  ], 

Q.  = D.  - H.,  S.  = D[Q.G.Q.]  and  assuming  D.G.Q.  = Q.G.D.,  we  have 

xi  i li  LXi  ixiJ  & i ixi  xi  i i’ 


,(»  - 

(i+1) 


0,  q 


(1) 


.CD 


v, 


p(l)  + D^1(Q.p(i)  + q(i)) 
q(l+1)  = F.(Q.G.q(l)  +S,p(l+1)). 


Theorem  4.3.  Let  P = I,  P.  = B[H._n ] . . ] , i 

o(I)  ■ <VPi  • V*- 


Then  p 


(i) 


(I  - P^x, 


Remark.  Although  not  expressed  in  this  form,  Buzbee  et  al.  [B20]  prove  spe- 
cial cases  of  these  formulae  for  the  odd-even  reduction  applied  to  Poisson's 
equation. 

Proof.  To  begin  the  induction,  * 0 = (I  - P^)x,  q*'^  = v = H^x 

3 (®1^1  " Suppose  p^  = (I  - E\)x,  q ^ * (D^P^  - Q^x.  Then 

■Ar 

LCM  a least  common  multiple  of  the  matrices  considered  as  polynomials  in  T. 


43 


p(l+1)  = (I  - Px)x  + - P.)x  + (D.P.  - Ql)x) 

= (I  - P.  + 5T  1Qi  - D^Q.P.  + D^D.P.  - D^1Q.)x 


= (I  - B[H.]P.)x 


= (I  - Pi+1>* 


, ( i+1)  _ - (i+1) 


= Hi+lX  - W1  - Pi+l)X 

(P>i+l  " ^i+1  ' °i+l  + °i+lPi+l)x 
= <5i+lPi+l  " WX‘ 


We  note  that  when  = DfH^]  1 and  S = ||  B [ A ] 1 1 < 1 then  ||P.J|  = 
and  Pk(D.P.  - Q.)  * Pk(D . ) ||  P . ||  + Pk(Q.)  * ^(D.)  + ok(Q.)  = Pk(H.)  , so 
||  D^I\  - Q^JI  £ ||  ||  . It  also  follows  that  ||  x - ||  ^ ||  K||  ||  x||  and 

||  q^"1  + Q^xJI  £ ||  ||  ||  x||  . Thus  p^  will  be  an  approximation  to  x; 

Buzbee  [B17]  has  taken  advantage  of  this  fact  in  a semidirect  method  for 
Poisson's  equation. 


/"'-I 


4 .C.  Odd- Even  Reduction 
4.C.I.  The  Basic  Algorithm 

The  groundwork  for  the  odd-even  reduction  algorithms  has  now  been  laid. 
Returning  to  the  original  derivation  of  odd-even  elimination  in  (4.1)  and 
(4.2),  suppose  we  collect  the  even-numbered  equations  of  (4.2)  into  a 
linear  system  A*'2^2'*  = w^,  where 

A<  ' ' ('S2jb2 j-ia2j-l,b2 j"  a2Jb2J-lc2j-l  * °2 jb2 j+l*2J+l’"C2 j^j+l^J+l’iy 

*«>  - <*2)V,  ' 
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W<  } " (V2j  " a2jb2j-LV2j-l  " C2jb2j+lV2j+l)N2  = 

N2  = LN/2J. 

(2) 

This  is  the  reduction  step,  and  it  is  clearly  a reduction  since  A is  half 

(1)  (2)  (2)  (2)  (2) 

the  size  of  A =*  A.  Once  A x = w has  been  solved  for  x we  can 

compute  the  remaining  components  of  x^  * x by  the  back  substitution 
X2j-1  = b2j-l(v2j-l  ' a2j-lx2j-2  " C2j-lX2j)’  j = N/2'’ 


Since  is  block  tridiagonal  we  can  apply  this  procedure  recursively 

to  obtain  a sequence  of  systems  A^x*'^  = , where  A^  = A,  w^  = v, 


x(L)  = (xj2i-l>Ni»  - N,  N.+1  = LN./2J.  We  write  Avi;  = (a^;  ,c > 

x^  = x . 2 l-  1 • The  reduction  stops  with  A^x^  = w^m\  m = log^  \j+l” 

(note  the  change  in  notation)  since  A^  consists  of  a single  block.  At 

this  point  the  back  substitution  begins,  culminating  in  x*"^  = x. 

m rn+ 1“  i 

Before  proceeding,  we  note  that  if  N = 2 -1  then  = 2 - 1.  The 

odd-even  reduction  algorithms  have  usually  been  restricted  to  this  particular 
choice  of  N,  but  in  the  general  case  this  is  not  necessary.  However,  for 
A = (a,b,a)^  there  are  a number  of  important  simplifications  that  can  be 
made  if  N = 2m - 1 [B20].  Typically  these  involve  the  special  scalings  men- 
tioned earlier,  which  effectively  avoid  the  inversion  of  the  diagonal  blocks 
during  the  reduction. 

It  is  also  not  necessary  to  continue  the  reduction  recursively  if 
(k)  (k)  (k) 

A x = wv  can  be  solved  more  efficiently  by  some  other  method.  This 

leads  to  various  polyalgorithms  and  the  semidirect  methods  of  Section  7. 

By  the  way  in  which  A^  has  been  constructed,  it  is  clear  that  it 

(4) 

consists  of  blocks  taken  from  H^.  For  example,  in  Figure  4.1a,  A is 


(*>  = /-(T)  u(i)  „(*)■ 
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the  (8,8)  block  of  . It  follows  immediately  that  ||  A 


* H . 


B[Hi]  ||,  and  ||  C[A(i)  ] * ||  C[H(l)  ] ||  . As  a corollary 


of  Theorem  4 . 1 we  actually  have 


Corollary  4.1.  If  ||b[A]||<1  then  ||  B[A( L+1)  ] ||  ^ ||  B[A (l)  ]2  ||  and 
||A(l+l)||i  If  ||C[A]||1<  1 then  ||  C[A(  I+1> ) ||j  s ||  C(A(1)  ]2  ||j 

anh  ||A<1+1)||lS  Ha'0^. 

1 1.1 j _ r • ! j £ -m *■»  / i _ a.  A ✓ i \ _ / i I "*  1 _ II  1 1 “ 1 


and  A 


Recalling  the  definition  of  Theorem  3.4,  let  X(A)  = maxj(4||bj  a_.  ||  ||b_. 


Theorem  4 .4 . If  X(A)  £ 1 then,  with  X^  = X(A^),  we  have 


t(l+1)  <;  (X(l)/(2  - \(l)))2. 


Proof.  It  will  suffice  to  consider  i = 1.  We  have 


CT2j  " b2ja2jb2j-lC2j-l  + b2jC2jb2j+ia2j+l 

(2)  ,-l 

3j  = -a2jb2j-la2j-l’ 

c(2)  = -c  b'1  c 
Cj  C2j  2j+lC2j+l’ 

II  CT2  j H ^ II  b2  ja2  j II  II  b2 J-  1C2  j-  1 II  + II  b2j+ia2j+l 

!Rl“a2 j)~ 1 II  * (1  - V2)_1  = 2/(2  - X), 

, I,  (2)-l  (2)  ||  ,|  (2)-l  (2)  || 

4Hbj  aj  llllbj-l  Cj-lH 

^ ^ lid  " o>2J)“1||  ||(1  - ^2j.2)_1||  (II  ^2  ja2  j 

(II  b2  j-  Ia2  j-  1 II  II  b2  j-2C2  j-2  I'  ) 

s 4(2(2  - X))2(V4)2  - ( V (2  - X))2. 


b2jC2jH  * 2(X/4) 


b2  j-  1C2 j-  1 II  } 


Thus  X(2)  s (X(1)/(2  - X(1)))2. 
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This  may  be  compared  to  earlier  results  for  constant  diagonal  matrices, 

(2)  2 2 

which  gives  an  equality  of  the  form  3V  = a / (2  - 3 ) . See  also  Theorem 
4.8  of  [ V3  ] . 

4.C.2.  Relation  to  Block  Elimination 

By  regarding  odd-even  reduction  as  a compact  form  of  odd-even  elimina- 
tion we  are  able  to  relate  it  to  the  Newton  iteration  for  a"*.  It  is  also 

possible  to  express  the  reduction  algorithm  as  block  elimination  on  a per- 
T 

muted  system  (PAP  ) (Px)  = (Pv)  (cf.  [311],  [E3],  [W2]).  In  Section  5 we 

show  how  this  formulation  also  yields  quadratic  convergence  theorems. 

Specifically,  let  P be  the  permutation  matrix  which  reorders  the  vector 
T 0 

( 1 ,2 , 3 , . . . ,N)  so  that  the  odd  multiples  of  2 come  first,  followed  oy  the 
1 2 

odd  multiples  of  2 , the  odd  multiples  of  2“,  etc.  For  N = 7, 

P(l,2, . . . ,7)T  = (1,3,5,7,2,6,4)T,  and 


This  factorization  gives  rise  to  a related  method,  first  solving  L' z = Pv, 

then  U' (Px)  3 z.  It  is  easily  seen  that  z^  = 

We  may  now  apply  any  of  the  theorems  about  Gaussian  or  block  elimina- 
T 

tion  to  PAP  and  obtain  corresponding  results  about  odd-even  reduction  and 
the  above  modification.  In  particular,  Theorem  3.2  applies  though  its 

conclusions  are  weaker  than  those  of  Corollary  4.1.  It  follows  from  the 

T 7 (i) 

PAP  = LL  factorization  as  well  as  Theorem  4.2  that  A will  be  positive 

definite  if  A is;  Theorem  4.2  also  allows  the  conclusion  that  p(B[A^])  < 1. 


54 


The  odd-even  reduction  is  also  a special  case  of  the  more  general 

cyclic  reduction  defined  by  Hageman  and  Varga  [HI]  for  the  iterative  solu- 

T 

tion  of  large  sparse  systems.  Since  PAP  is  in  the  2-cyclic  form 


PAP 


and  D?  block  diagonal,  we  can  premultiply  by 

f 1 'A12D2 

V^l0'!1 

to  obtain 


A12D2  A21 


It  is  clear  that  A 


(2) 


°2  ' A21°l  A12> 


°2  " A21°l  A12  and  (D1  " A12°2  A21’ 


0)  are  the  rows 


of  ^ that  have  not  been  computed  as  part  of  the  reduction.  In  consequence 


of  this  formulation,  if  A is  an  M-matrix  then  so  A 
P(B[A(1+1)])  < p2(B[AU)])  < 1 [HI]. 


(i) 


and 


The  block  LU  factorization  PAP  = LAU,  where  A is  block  diagonal  and 
L(U)  is  unit  lower  (upper)  triangular,  determines  the  reduction  and  back 
substitution  phases.  Norm  bounds  on  L and  U were  given  in  Section  3. A. 2 
for  general  matrices  with  ||  B[A]||  < 1.  For  block  tridiagonal  matrices  in 
the  natural  ordering  we  have  ||  L 1 1 ^ £ 1 + ||  CfAjHj  , ||  L||  £ n||  C[A]||^  if 
||  G[A]  ||  x < 1,  ||U||  S 1 + ||  B[A]||,  ||  U||x  * 1 + n||  B[A]||  if  ||B[A]||<1, 

n = maXj  n ^ . For  the  permuted  block  tridiagonal  system  only  the  bounds  on 
||  L ||  and  1 1 U 1 1 ^ wi  1 1 grow,  and  then  only  logarithmically.  By  the  construc- 


tion of  L and  U, 
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||  L ||  s 1 + max  ||  C[A(l)]||  , 

15iSm  1 

INI  * 1 +_  II  C[A(1)  ] ||, 

i-1 

Hull  I*l+L  II  B[A(i>  ] ||  , 

i-1  1 

II  U II  s 1 + max  ||  B[A(l)  ] ||. 
l^i^n 

Since  C[A<'l)]  is  block  tridiagonal,  p^CTA^])  5 2n  ||  C[A  ^ ] || , so 
||  C[A('I')  ] ||  5 2n ||  C[A(l)  ]||  L;  similarly  ||  B[A(l)  ] ||  1 5 2n||  B[A(l)  ||  . If 
II  B[A]  ||  < 1,  ||  C[A ] ||  < 1,  we  then  have 


||  Lllj  * 1 + ||  CCAJHj 

||  L||  5 1 + 2nm||  C[A]||lf 

||  U||  x 5 1 + 2nm||  B[A]  || , 
||  U ||  ^1+  ||  N[A]  || . 


Much  like  the  proof  of  Corollary  3.4  we  also  have,  if  ||  B[A]|| 


> x < 


SO 


and 


lib 


(i+l) 

tf- 

(i+l) 

j 


b(i)n 

b2j  (1  a2j  ) 


II  B[A 


(i)]2 


b^}||  (1  + Si> , S.  = ||  B[A(i)]2 


II  (bji+1))  “1 1| 

...  (i+l). 
cond(bj  ) 


(b^V1!!/  (i  - tj. 


cond(b 


(i) 
2 j 


)(1  + S.)  / (1  - 3.). 
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Since  our  interest  is  in  bounding  cond(b^ji+^) 
must  simplify  the  above  expression.  Let  = II 
that 


in  terms  of  cond(b^) , we 
j=1(l  + 3 j ) / ( 1 - Bj),  so 


cond(b^1+1)) 


5 [max^  cond(bk)]Ki 


by  induction. 


Using  |3 


we  obtain 


Kis"1i-i(1  + 3j)/<1-  ^.i> 

= (i  + s.)/[(i  - spn^Jd  - s.)] 

< (1  + 3.)/(l  - S^1. 


This  is  a rather  unattractive  upper  bound  since  it  may  be  quite  large  if  3^ 
is  close  to  1.  Actually,  the  situation  is  not  so  bad,  since 

II  II  ^ Jill  on(^  -i^ll  _ 


2 j 


'll<fll  B[A  ] ||  = 2^i'  and 


cond(b^1+1))  < [maxj^  cond(bk)  ] j _ ^ 


2+3. 


If  X(A)  s 1 then  ||  <j<J}  ||  <;  X(i)/2,  \(l)  = X(A(l)),  and 


cond(b^L+1))  £ [maxk  cond(bk) ] 2 + X 


(i) 


2 - X 


(i) 


^ 3 ma3C[c  cond(bk> 


When  A is  positive  definite  it  follows  from  various  eigenvalue- interlacing 
theorems  that  cond(b*j^)  < cond(A<'^’^)  < cond(A). 

4.C.3.  Back  Substitution 

Now  let  us  consider  error  propagation  in  the  back  substitution.  Suppose 
yki+^  ■ x,<i+1)  + ?k^+^\  1 ^ k S defines  the  computed  solution  to 

. ( i+1)  (i+1)  „ (i+1) 

A x a w 


, and  we  compute 
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1 


Then 


*J-  1 


(b^  )'1(W(1)  - a'  ' y'~  - . c 

V 2j-l;  V 2j-l  2i-l  yi-1 

.(i) 


(i+1) 

2J-1  yj-l 


(i)  (i+1). 

2J-1  yj 


+ e 


2 j—  1 * 


y(«  , y(i+1> 

y2j  yj 


§(1) 

^2  j- 1 


rw 

' 2j-  1 


.(i) 

'2j-l 


(b^  )"1  a^  |(i+1)  _ (b^  c(i)  e(i+1^ 

V 2j-l;  a2j-l  *J-1  ^j-l*  C2j-1 


+ e 


(i) 

2 j-  1 * 


-(i)  , -(i+1) 
a2  j -j 


Now  define  the  vectors 


.(i) 


.(i) 


zl  " <ei  * ° ’ «3  » °’ 


.(i+1) 


(i+1) 


(0,  5 > o,  . 


We  have  |j  z^\\  - ||  *(i+^||  by  construction,  and 


?(l)  = B(l)z2  + zx  + z2,  B(l)  =■  B[A(l)]. 


But  in  consequence  of  the  construction  we  actually  have 


If  ||  B[A]  ||  <1  then  ||  B(1)z,||  £ ||  B 


§(L)  ||  ^ max(||  B(l)z2  + zjl  , ||  z2  ||  ) . 
(i) 


z2  II  < II  Z2II  • 


Suppose  ||  ?(m)  ||  £ e and  ||  e(l)||  * e for  all  i =*  l,...,m-l. 

II  ?(l)  II  s ™*<il  B(i)  ||  ||  z2||  + ||  z1||,  ||  z2  ||  ) 

* II  Z2II+  II  ZJI 


(i+Dn  I,  e(i) 


s (m  - (i+1)  + 1) e + e 
“ (m  - i + 1) e. 


Then 


(by  induction  on  i) 
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and  ||  ||  , the  error  in  the  computed  solution  of  Ax  = v,  is  bounded  by 

me,  m =>  rlog2  N + l"!  . This  is  certainly  a satisfactory  growth  rate  for 
error  propagation  in  the  back  substitution.  In  fact,  it  is  a somewhat 
pessimistic  upper  bound,  and  we  can  more  practically  expect  that  errors  in 
the  solution  of  x^+^  = w^+^  will  be  damped  out  quickly  owing  to 

the  quadratic  decreases  in  ||  ||  . 

4.C.4.  Storage  Requirements 

On  the  matter  of  fill-in  and  storage  requirements,  we  note  that  block 
elimination  in  the  natural  ordering  of  A,  as  discussed  in  Section  3. A,  re- 
quires no  additional  block  storage  since  there  is  no  block  fill-in.  However, 
it  has  already  been  seen  that  odd-even  reduction  does  generate  fill-in.  The 

amount  of  fill-in  is  the  number  of  off-diagonal  blocks  of  A ^ , ...,A^m\ 
m 

which  is  2(N.  - 1)  < 2N.  Since  the  matrix  A requires  3N  - 2 storage 

~i=2  L 

blocks,  this  is  not  an  excessive  increase.  Of  course,  we  have  not  cons  id  red 
the  internal  structure  of  the  blocks,  as  this  is  highly  dependent  on  the 
specific  system,  nor  have  we  considered  temporary  storage  for  intermediate 
compu tat ions . 

We  do  note,  however,  that  if  the  factorization  need  not  be  saved,  then 

some  economies  may  be  achieved.  For  example,  in  the  transition  from  A^ 

(2)  (2)  (2)  (2) 
to  A , the  first  equation  of  A x = w is  generated  from  the  first 

three  equations  of  A^x^  = w^  . The  first  and  third  of  these  must  be 

saved  for  the  back  substitution,  but  there  is  no  reason  to  save  the  second 

(2)  (2)  (2)  (2)  (2)  (2) 
once  Av  xv  = wv  is  formed.  Thus  the  first  equation  of  A^  xv  = wv 

may  overwrite  the  second  equation  of  A^x^  * w^  . Similar  considera- 
tions hold  throughout  the  reduction,  but  some  modifications  are  needed  on 
the  CDC  STAR-100  due  to  its  restrictions  on  vector  addressing  (see  Section  10. B). 
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The  special  scalings  of  odd-even  reduction  for  systems  derived  from 
separable  elliptic  PDE's  allow  implicit  representation  of  the  blocks  of 
as  polynomials  or  rational  functions  of  a single  matrix  variable. 

Since  the  blocks  of  A are  quite  sparse  this  is  necessary  in  order  to  limit 
intermediate  storage  to  a small  multiple  of  the  original  storage.  Unfor- 
tunately, there  does  not  seem  to  be  a similar  procedure  that  can  be  applied 
to  nonseparable  PDE's.  The  only  recourse  is  to  store  the  blocks  as  dense 
matrices,  and  on  current  computing  systems  this  will  limit  them  to  small 
sizes . 


4.C.5.  Special  Techniques 


Now  suppose  that  A = (-p  ,t  + p + q , -q  ) , p ,q  ,t  i 0,  p = q = 0; 

J J J J J J J J IN 

we  would  like  to  apply  Babuska's  tricks  for  the  LU  factorization  (Section 

3 .A. 4)  to  the  factorization  induced  by  odd-even  reduction.  This  only  par- 

(2) 


tially  succeeds.  First,  we  can  represent  A 
where 


(2)  (2)  (2)  (2)  (2) 

(-p . ,t.  +p.  + q . ,-q . ) 

j j j j j 


pj 


(2) 


.(2) 


.(2) 


qj 


P2 j b2j-l  P2 j— 1 * 

C2j  + P2 j b2j-l  *"2 j- 1 + q2j  b2j+l  C2j+1’ 
q2j  b2j+l  q2 j+1 ’ 

Ck  + Pk  + V 


(2) 

Note  that  t^^  5 £ b2j"  tbe  ot^er  han^»  the  technique  of  avoiding 

T 

the  back  substitution  by  combining  the  LU  and  UL  factorizations  of  PAP 

_T 


will  not  work.  Let  A'  = PAP 


(I  + L)  (D  + U)  = (I  + U*)  (D*  + L*)  , (I  + L)f  = Pv, 


JU  J.  ^ ^ 

(D  + U)  (Px)  = f,  (I  + U )f  53  Pv,  (D  + L ) (Px)  = f , so  that 
(D  + D*  - D[A'  ])  (Px)  =*  f + f*  - (L*  + D(A'  ] + U)  (Px)  . 
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The  only  way  to  have  A'  = L + D[A' ] + U is  if  no  fill-in  occurs  in 

ie 

either  factorization,  in  which  case  L = L[A' ],  U = U[A].  In  particular 
fill-in  occurs  in  odd-even  reduction,  so  the  method  does  not  work  in  its 
entirety.  However,  it  will  still  be  advantageous  to  do  the  reductions  by 
developing  the  t^  sequences.  (See  [SI]  for  some  examples.) 

SchrBder  and  Trottenberg  [SI]  also  consider  a method  of  "total  reduc- 
tion" for  matrices  drawn  from  PDE's  and  finite  difference  approximations 
on  a rectangular  grid.  Numerical  stability  and  further  developments  are 
discussed  in  [Sla],  The  total  reduction  is  similar  to  odd-even  reduction 
(which  is  derived  in  [SI]  by  a "partial  reduction"),  but  just  different 
enough  to  prevent  the  application  of  our  techniques.  Part  of  the  problem 
seems  to  be  that  total  reduction  is  more  closely  related  to  the  PDE  than 
to  the  matrix,  though  the  algorithm  is  expressible  in  matrix  form.  Various 
convergence  results  are  undoubtedly  obtainable,  but  we  have  not  yet  deter- 
mined the  proper  setting  for  their  analysis. 

4 .D.  Parallel  Computation 

Finally,  the  utility  of  the  odd-even  methods  for  parallel  or  pipeline 

computation  has  been  demonstrated  in  several  studies  [H3].  Stone  [ S2 ] , 

Lambiotte  and  Voigt  [L2],  and  Madsen  and  Rodrigue  [Ml]  discuss  odd-even 

★ 

reduction  for  tridiagonal  systems,  and  conclude  that  it  is  the  best  method 

to  use  when  N is  large  (>256).  Jordan  [Jl]  considers  odd-even  elimination, 

which  is  shown  in  [Ml]  to  be  most  useful  for  middle  ranges  of  N (64  < N < 256) 

on  a particular  pipeline  computer,  the  CDC  STAR-100.  For  small  values  of 

N the  sequential  LU  factorization  is  best,  owing  to  the  effect  of  lower  order 

terms  engendered  by  the  pipeline  mechanism. 

★ 

The  basic  method  given  here  is  a block  analogue  of  the  method  examined  in  [L2]. 
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The  inherent  parallelism  of  the  odd-even  methods  is  in  formation  of 

. . c , (i+1)  ( i+1)  ( i+1)  (i+1)  . 

the  equations  of  A x = wv  or  lL+^x  = vv  and  in  the  back 

substitution  for  odd-even  reduction.  It  is  readily  seen  from  (4.2)  that 

H0  and  v^  may  be  computed  from  and  v^  by 

compute  LU  factors  of  b^,  (1  £ k £ N) 
solve  bk[^  ck  vfc]  - [ak  ck  vk],  (1  * k £ N) 


- bk  ' Vk- 1 

- Vk+l’  (1 

s k £ N) 

<2> 

/V 

*"  Vk  ' ak\-l 

- Vk+i’  a 

£ k £ N) 

42) 

- -Vk-1’  (3 

s k £ N) 

<2) 

**  "CkCk+l  ’ (1 

^ k s N - 2) 

(Special  end  conditions  are  handled  by  storing  zeros  strategically  or  by 

(2)  (2) 

shortening  vector  lengths  slightly.  A ' and  w ' are  computed  by  only  gen- 
erating new  equations  for  even  k;  the  back  substitution  is  then  rather 

(2) 

straightforward  once  x is  computed. 

The  algorithm  for  odd-even  reduction,  N = 2m-l,  all  blocks  n X n, 

* - 1,  is  as  follows:  for  i = 1 m- 1 


compute  LU  factors  of  b^j^  , , (0  ^ k ^ N.  ,) 

Z k.+ 1 l+l 
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solve  b$m)  x|m)  = wjm) 
for  i = m- 1 , . . . , 1 
solve 

soive  D2k+1  x2k+1 


(w(i)  - a(i)  x,(i+1) 

' 2k+l  a2k+l  *k 


c 


(i)  (1+1) 

2k+l  k+1 


) , (0  £ k * Ni+1) 


The  ith  reduction  and  corresponding  back  substitution  uses 
2 3 2 3 

n + 0(n  ))N^+^  “ 12 n arithmetic  operations,  and  the  total  operation 

2 3 2 3 

count  for  the  algorithm  is  (12-jN  - 12m)n  + 0(n  N + n ) . This  may  be  com- 

2 3 2 3 

pared  to  4-jNn  + 0(n  N + n ) operations  for  solution  of  Ax  = v by  the  LU 
factorization . 

★ 

For  a crude  estimate  of  execution  tine  on  a pipeline  computer,  we 
take  tn  + a,  t « a,  as  the  execution  time  for  operations  on  vectors  of 
length  N,  and  t,  r « t « cr,  as  the  execution  time  for  scalar  operations 

[H3].  The  total  execution  time  for  odd-even  reduction  would  then  be 

232  3 2323 

roughly  (12— n + 0(n  ))(tN  + am)  + 0(n  t)  versus  (4—  n N + 0(n  N + n ))t 

for  the  LU  factorization  done  completely  in  scalar  mode.  A qualitative 
comparison  shows  results  similar  to  the  more  precise  analysis  given  to 
tridiagonal  systems:  for  a fixed  value  of  n,  odd-even  reduction  will  be- 
come increasingly  more  effective  than  the  LU  factorization  as  N increases. 
Odd-even  elimination  should  maintain  its  utility  for  moderate  values  of  N; 
the  factor  tN  + am  is  replaced  by  (tN  + <7)m,  but  data  manipulation  is  often 
much  simpler  than  in  odd-even  reduction.  The  actual  crossover  points  be- 
tween algorithms  are  implementation  dependent  and  will  be  considered  to 
a limited  extent  in  Section  10. C. 

A critical  observation  on  the  execution  time  of  odd-even  reduction  is 
that  roughly  50^  of  the  total  time  is  spent  in  the  first  reduction  from 
* 

A more  detailed  timing  analysis  is  in  Section  10. C and  Appendix  B. 


J 
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A^  = A to  . Special  efforts  to  optimize  the  algorithm  by  solving 
the  reduced  systems  differently  must  take  this  fact  into  account.  Further- 
more, in  timing  each  reduction  it  is  seen  that  roughly  60$  of  the  high- 
order  term  n^N  comes  from  the  matrix  multiplications  used  to  compute  A^i+^ 
and  30$  from  solving  the  2n+l  linear  systems  for  a„,  , etc.  Relatively 

4.  Kr  L 

little  effort  is  used  to  factor  the  b^'s  or  to  make  back  substitutions 

J 

having  saved  the  factorizations. 

The  use  of  a synchronous  parallel  computer  creates  some  constraints 
on  implementation  which  are  not  so  important  on  a sequential  computer. 

The  first  restriction  in  the  interests  of  efficiency  is  that  the  blocks  of 
A should  all  be  the  same  size.  This  allows  simultaneous  matrix  operations 
(factoring,  etc.)  to  be  programmed  without  resorting  to  numerous  special 
cases  due  to  varying  block  sizes.  In  addition,  pivoting  causes  ineffi- 
ciencies when  the  pivot  selections  differ  between  the  diagonal  blocks. 

As  with  the  LU  factorization,  the  assumption  j|  3[A]||  < 1 only  guarantees 
that  no  block  pivoting  is  needed.  If  A is  strictly  diagonally  dominant 
or  positive  definite,  then  no  pivoting  within  the  blocks  will  be  required. 

Thirdly,  the  choice  of  data  structures  to  represent  the  matrices  and 
vectors  must  conform  to  the  machine's  requirements  for  the  parallel  or 
vector  operations.  This  is  an  important  problem  on  the  CDC  Star  owing 
to  its  limitations  on  vector  addressing  and  allocation  [LI].  When  A is 
restricted  to  having  small  blocks  this  is  not  as  serious  as  it  would  be 
otherwise.  In  the  case  of  the  Illiac  IV,  the  problem  of  data  communica- 
tion between  processing  elements  must  also  be  considered. 
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4 .E.  Summary 

We  now  summarize  the  results  of  this  section.  The  odd-even  elimina- 
tion and  reduction  algorithms  may  be  safely  applied  to  the  solution  of 
certain  block  tridiagonal  linear  systems,  and  in  some  cases  various  impor- 
tant quantities  involved  in  the  solution  decrease  quadratical ly  to  zero. 
The  quadratic  convergence  may  be  explained  by  expressing  the  methods  as 
Newton-like  iterations  to  compute  a block  diagonal  matrix.  Variations  of 
the  basic  methods,  which  have  important  applications  to  PDE's,  also  dis- 
play quadratic  convergence.  Parallel  computation  is  a natural  format  for 
these  methods,  although  practicalities  place  some  restrictions  on  the 
type  of  system  that  can  be  solved  efficiently. 
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5.  UNIFICATION:  ITERATION  AND  FILL-IN 

Having  presented  several  linear  and  quadratic  methods,  we  now  treat 
them  in  a unified  manner  as  particular  instances  of  a general  nonstationary 
iteration.  We  are  also  lead  to  an  interpretation  of  quadratic  convergence 
in  terms  of  matrix  fill-in,  and  conclude  that  the  odd-even  methods  are  the 
only  instances  of  the  general  iteration  which  display  quadratic  convergence. 
These  results  apply  to  arbitrary  matrices  satisfying  |j  B[A]||<  1 and  not 
only  to  the  block  tridiagonal  case. 

Let  us  first  collect  the  formulae  for  the  block  LU  factorization, 
block  Gauss- Jordan,  and  odd-even  elimination. 

LU:  A.+  1 = (I  - L[Ai]EiD[Ai]“1)A1,  Aj  = A. 

GJ:  A.+1  = (I  - (L[.\.]  + UrA.])E.D[Ai]‘1)A.,  Aj  = A. 

OE:  Hi+1  = (I  - (L[H.]  + ^H.  , ^ = A. 

Now  consider  the  function  F(X,Y,Z)  = (2 1 - XY)Z,  and  observe  that 

Ai+1  = F(D^Ai^  + UAi]Ei,  DCA.f1,  A.) 

Ai+1  - F(D[A.]  + (L[At]  + U[Ai])Ei,  DfAj'1,  A.) 

Hi+1  = F(Hi’  D^Hi^1»  Ht) 

Thus  it  is  necessary  to  analyze  sequences  based  on  repeated  applications 
of  F. 

For  fixed  nonsingular  Y we  can  define  the  residual  matrices 
r (M)  = I - MY,  s(M)  * I - YM,  so  F(X,Y,Z)  = (I  + r(X))Z  and 

r (F (X,Y,Z) ) - r(Z)  - r(X)  + r(X)r(Z), 
s(F(X,Y,Z))  = S (Z)  - s(X)  + s (X) s (Z) . 


66 


By  taking  X = Z and  assuming  that  ||  r(Z1)  j|  < 1 or  ||  s(Zj)  Jj  < 1,  the  sta- 
tionary iteration  3 FlZ^.Y.Z^)  converges  quadratical ly  to  Y This 

is,  of  course,  the  Newton  iteration. 

The  related  operator  G(X,Y,Z)  = Z + X(I  - YZ)  3 Z + Xs  (Z)  has  pre- 
viously been  studied  as  a means  of  solving  linear  systems  [H8].  When 
Zl  3 X,  Zi+^  3 G(X,Y,Z^),  Bauer  [B8]  observes  that  r(jL  + ^)  3 r(X)r(Zi>, 
s(Z1+1)  3 s(X)s(Zi),  Z.  = (I  - rCXiSY"1  3 Y_1(I  - s(X)S,  Zi+R  = G^.Y.L), 
and  in  particular  Z9^  = G(Z^,Y,Z^).  This  last  formula  also  leads  to  the 
Newton  iteration,  for  G(Z,Y,Z)  * F(Z,Y,Z),  s(Z9i>  = s(Z.)2. 

In  this  section  we  study  the  nonstationary  iteration  Z ^ - A, 

Z.,.  = F(X.,Y.,Z.),  where  Y.  = D[Z,]  ^ and  X.  consists  of  blocks  taken 

i+l  ill  i L iJ  i 

from  Z^,  including  the  diagonal.  The  selection  of  blocks  for  X^  is  al- 
lowed to  change  from  step  to  step,  but  we  shall  assume  that  the  selection 

sequence  is  predetermined  and  will  not  be  altered  adaptively.  (The  cost 

of  such  an  adaptive  procedure,  even  on  a highly  parallel  computer,  would 
be  proh ib i t ive . ) 

For  the  solution  of  a linear  system  Ax  = v we  take  (Z^,v^)  = (A,v), 

(Z^+^,  = (21  ■ X^Y^)  (Z^,v^)  , with  the  sequence  ending  in  or  approach- 

★ ★ * 1 * 
ing  (Z  ,v  ) , where  Z is  block  diagonal.  Then  x = Z v . One  necessary 

condition  for  consistency  is  that  21  - X^Y^  = I + ri(X^)  be  nonsingular, 

which  holds  if  and  only  if  p(r^(X^)>  < 1.  Since  r^X^)  and  s^(X^)  3 I - Y^X^ 

are  similar  we  could  also  assume  p(s^(X^)>  < 1.  Using  earlier  notation, 

ri(Xi>  3 C[X^],  s^(X^)  3 B[X^];  the  notation  has  been  changed  to  provide 

greater  generality.  In  the  context  of  an  iteration  converging  to  a block 

diagonal  matrix,  the  condition  ||  B[A]||  < 1 (implying  ||  s^X^)  ||  ^ ||s^(Z^)  ||  < 1) 

states  that  A is  sufficiently  close  to  a particular  block  diagonal  matrix, 


namely  D[A]. 


67 


We  first  generalize  earlier  results  on  linear  methods  C ||s ^ (Z J j < 1 =» 
||  si+i^Zi+x^  H 5 II  s^(Z^)  II  ^ and  t,len  determine  conditions  under  which  we 
obtain  a quadratic  method  (||  ||  < 1 =»  ||  si+1(Zt+1>  ||  £ ||  si(Zi)  ||2)  . 


Theorem  5 . 1 In  the  sequence  Zi+1  = F^.Y^Z^  , suppose  that 

{CP ,P) | 1 ^ P ^ n}  C Tt  C f(p ,q) | 1 ^ p,q  * N}  = U, 

X.  = L E Z.E  , Y.  3 D[Z.  ]_1, 
i.  i pio  i L i J J 
i 

si(M)  = I - YJ1. 

Then  ||  s^CZ^)  j|  < 1 implies 

II  H * H II  * H H - 


Proof.  Suppose  ||  s^Z^  |j  < 1.  Define  Q by  Zi+^  = Z^  - DfZ^Q,  so 
Q = -Y.Z.  + Y.X.Y.Z.  = -s.(X.)  + s.(X.)s.(Z.).  Since  D[s. (X.)  ] = 0, 

S = D[Q  ] = D[s1(X1)st(Z1)].  Since  s.(X.)  = ^ . EpS  .(Z.)  Eq , ||  s.(X.)  ||  * ||s.(Z.)  || 
and  ||  S ||  £ ||  si(Zi)  ||2  < 1.  Thus  D[Zi+1]  = DCZ±]  (I  - S)  , Y.+1  = (I  - S)"^., 
and  s^CZ^^)  = (I  - S)s.+^(Zi+^)  + s*  Lemma  2.2  applies,  so  that  ||s^(Z^  .)  ||  < 1 
implies  ||  s.+1(Z1+1)||  s ||  st(Zi+1)||.  Bue  s.(Z.+1)  - S.CZj)  + Q - 

si<Zl>  * si«i>  + si<Xi)si<Zi>  ‘ WiC,*,  + ^.Vi'Wi'V’ 


r = U - T^  so 


+ o.(L  E s.  (Z.)E  ) 
j TL  p iv  l q' 

^ Dj(si(Zi)) . 


Si(zi) 


Thus  ||  s.(Z.+1)  ||  £ ||  s.(Z.)  ||  < 1. 


QED. 
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To  see  when  the  method  is  actually  quadratic,  we  first  establish  that 

2 

at  least  some  blocks  of  s^+^(Z^+^)  are  bounded  in  norm  by  ||  s^(Z^)  ||  ■ 


Corollary  5.1.  With  the  hypotheses  of  Theorem  5.1, 


L E s.  ,(Z.,  . ) E I £ I s,(Z.) 
T.  p l+l  i+l  q 1 1 i i 


Proof.  From 


(5.1)  (I  - S)  s.+1(Z.  + 1)  + S = s.(Z.)  - s.(X.)  + s.CX^s.CZ.), 


S = D[si(Xt)s.(Zi)], 


and  E (I-  S)  = (I-  S)E  , we  obtain 
P P 


(I  - S)  L E s . , . (Z.  .)E  + S 

T.  p l+l  1+1  q 


\EP  si(zi,E,  - \ep  Sl<Xl>E« 

l+  +.EP  WdW 


But  .i(Z1)Eq  - s.(X.)  - 31(Xi)Eq, 

so  (I  - S)  Sj  Ep  s1+1(Zl+1)Eq  ■ Sj  Ep  s.tx^  sl(Z1)Eq  - S.  By  Lemma  2.1, 


Vp  S1S-1(Z1+1)E,II  S 'I  Vp  WS1(Z1)EJ  S II  sl<Xl)Sl<Zl)H  S 1K<V 


In  certain  cases  Corollary  5.1  provides  no  useful  information,  since 

we  could  have  2LE  s.  ,(Z.  .)E  =0.  Block  elimination  is  one  example 

i ^ p l+l  l+l  q 

where  this  occurs.  However,  it  is  seen  that  if  we  are  to  have  |js^+^(Z^+^)  || 

2 

£ ||  s.(Z.)  ||  then  we  must  have 

(5.2)  llEj.Ep  st+1«1+1)Eq||  5 II  II2. 

This  is  certainly  true  if  T^  ia  the  null  set,  in  which  case  we  are  generating 


69 


the  sequence,  so  we  shall  assume  that  is  non-empty  and  show  that 
(5.2)  will  not  hold  in  general. 

We  set  one  final  condition  on  T\  for  notational  purposes,  though  it 

turns  out  to  have  practical  consequences:  if  it  is  true  that  E Z E = 0 

p l q 

for  any  starting  matrix  Z^,  then  (p,q)  must  be  in  T..  For  example,  we 
now  write  the  Gauss- Jordan  algorithm  as 


A.  . = F(D[A.]  + (L[A.]  + U[A.])(Ij=1Ej) , D(A.]_1,  A.). 


i+1 


Inclusion  in  of  all  blocks  in  Z^  known  to  be  zero  clearly  will  not 
affect  the  sequence  Z.  , = F(X.,Y.,Z.).  With  the  additional  condition 

i+l  iii 

on  T.,  it  follows  that  if  (p,q)  £ T.  then  E Z.E  can  be  zero  only  bv 

i i p i q j j 

accident,  and  not  by  design. 


Corollary  5.2.  With  the  hypotheses  of  Theorem  5.1,  only  the  choice  = 6 
yields  a quadratic  method  for  arbitrary  starting  matrices  Z^. 

Proof . Suppose  that  T1  ^ 6.  Following  the  proof  of  Corollary  5.1,  we 


obtain 


(I  - s>  Vp  si+i<zi+i)E, 


?L,E  s . (Z . ) E + IL,E  s.(X.)s.(Z.)E 
p iv  l q p i l l l q 


which  implies,  by  Lemma  2.1, 


VP  =i+i<zi+i)Eql 

Er;Epsi(zi)Eq  + s + sr;Epsi(xi,si(zi)EP' 

1 1 r r 

The  presence  of  nonzero  1st  order  terms  (2^,, EpS ^ (Z^)E^)  in  general  dominates 

the  2nd  order  terms  (S  + £_,E  s . (X.)s . (Z.)E  ),  and  we  cannot  conclude  that 

T'.  p i i i l q 


(5.2)  will  hold  for  arbitrary  Z^  satisfying  ||  s ^ (Z ^)  | j < 1. 


QED 
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Now  let  us  consider  what  happens  when  block  fill-in  occurs,  as  it 

will  during  the  odd-even  algorithms  for  block  tridiagonals.  For  the  off- 

diagonal  (p,q)  block,  p * q,  we  have  E Z.E  = 0 but  E Z.  ,E  ^ 0.  Since 

p l q p 1+1  q 

Zi+1  = 2Zi  - Wi’  EpZi+lEq  = EpXi('YiZi)Eq  = EpXi  Si<Zi)Eq’  and 

II  EpZi+lEqH  * H EpXiH  II  Si(Zi)EqH-  From  <5a)  and  EpSEq  = °»  we  have 

(I  - S)  Ep  Si+l(Zi+l)Eq  = Ep  Si(Xi}  Si(Zi)Eq*  WhiCh  imPUeS  II  Ep  Si+l(Zi+l>Eql 


£ S + E s . (X.)  s . (Z.)E  s 
11  p l l 11  q 


Si(Zi) 


(We  cannot  conclude  that 


II  Ep  Si+l(Zi+l)EqH  * II  Si+l(Zi+l)ll^  however-> 

As  shown  by  these  inequalities,  fill-in  produces  small  off-diagonal 
blocks  of  Zi+1  when  ||  s^(Z^)  ||  < 1,  and  the  corresponding  blocks  of 
are  quadratically  small  with  respect  to  s^Z^).  Thus  one  source  of  quadra- 
tic convergence  is  from  block  fill-in  during  the  process  Z.  , = F(X.,Y.,Z.). 

l+l  v i i l 

In  summary,  we  now  have  two  explanations  for  quadratic  convergence  in 
the  odd-even  methods  when  applied  to  block  tridiagonal  matrices.  The  first 
is  that  the  method  is  a variation  of  the  quadratic  Newton  iteration  for 
A \ The  second  is  that,  in  going  from  to  H^_^,  all  the  non-zero  off- 
diagonal  blocks  of  are  eliminated  and  all  the  non-zero  off-diagonal 
blocks  of  H.  , arise  from  block  fill-in.  Each  block  of  BTH.  is  there- 
fore  either  zero  or  quadratic  in  3[H^]. 
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6.  HIGHER-ORDER  METHODS 

For  Poisson's  equation,  where  A = (-1,  b,  -I),  the  choice  N = 2™  - 1 
leads  to  considerable  simplification  in  the  Buneman  odd-even  reduction 
algorithm.  The  most  important  properties  are  A^  = (-1,  b^\  -I) 
(constant  block  diagonals)  and  the  expression  of  as  a polynomial  in 

b.  However,  in  some  applications  other  choices  of  N are  much  more  natural. 
Sweet  [S10]  therefore  described  a generalization  of  odd-even  reduction 
which  could  handle  these  cases  and  still  preserve  constant  block  diagonals 
and  a polynomial  representation.  Bank's  generalized  marching  algorithm 
[B5]  for  Poisson's  equation  uses  one  step  of  (very  nearly  the  same)  gen- 
eralized reduction  in  order  to  control  the  stability  of  a fast  marching 
algorithm.  This  technique  is  analagous  to  multiple  shooting  methods  for 
ordinary  differential  equations. 

In  this  section  we  analyze  Sweet's  method  as  yet  another  variation 
of  odd-even  reduction,  and  show  that  it  possesses  higher-order  convergence 
properties.  The  algorithm  actually  considered  here  yields  Sweet's  method 
through  a suitable  block  diagonal  scaling,  and  hence  our  results  carry 
through  much  as  before.  It  must  be  emphasized,  though,  that  the  methods 
of  this  section  are  not  intended  as  practical  methods  for  general  block 
tridiagonal  systems.  Rather,  our  purpose  is  to  elaborate  on  the  quadra- 
tic properties  of  odd-even  reduction  by  showing  that  they  are  special 
cases  of  convergence  properties  for  a family  of  methods. 

6. A.  p-Fold  Reduction 

The  idea  behind  the  generalization  of  odd-even  reduction  is  to  take 


a block  tridiagonal  linear  system  involving  the  unknowns  x^,x9,..., 


and 
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to  produce  a new,  smaller  block  tridiagonal  system  involving  only  the 
unknowns  x ,x„  . Once  this  system  is  solved,  the  knowns  x .... 

p 2p  J p’ 

are  back- substituted  into  the  original  equations  to  compute  the  remaining 
unknowns.  The  entire  process  will  be  called  p-fold  reduction;  odd-even 
reduction  is  the  case  p = 2. 

Let  p S 2 be  a given  integer,  and  define  = 'Jj/ pj . Starting  with 

(2)  (2)  (2) 

Ax  = v we  generate  the  reduced  system  A x = w , where 

A(2)  = (aj2)>  bj2)>  cj2))x2  ' x<2)  = <xpj>N  > and  the  diagonal  block  b^2) 

has  dimension  n..  For  1 s k s L + 1,  define 
PJ  2 


bkp-p+l  ckp-p+l 

akp-p+2  ^kp-p+2  Ckp-p+2 


akp- 1 kp- 


When  kp- 1 > N,  either  extend  Ax  = v with  trivial  equations  or  let  be 
redefined  as  a smaller  matrix  using  the  available  blocks  of  A.  Next,  re- 
partition A to  have  diagonal  blocks 


V V V b2p bN2p*  \+i 


and  call  the  new  system  Ax  = v.  Three  adjacent  block  rows  of  K are  thus 


(0 


The  p-fold  reduction  is  obtained  by  performing  a 2-fold  reduction  on  A, 
and  the  back  substitution  is  obtained  by  solving  the  system 


By  taking 

\p-tH-l  'kp-cH-l  *kp-p+i\ 


v\p-  1 


kp-1 


i-  V 


"kp-  1 / 


kp- p+1 

I 0 


kp-p+l' 


c v 

kp-  1 kp-  1 


the  reduction  step  simp  lies  to 


(6.1) 


■T 

f’ 

, (2) 

'k 

42)  = \p  ‘ akp  Vi  ‘ "kp  Vfl’ 


akp  °kp-  1 

bkp  akp  "kp-1  Ckp  \p+l 


Ckp  ^kp+l 


and  the  back  substitution  becomes 
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*kp-j  °kp-j  °kp-j  *kp-p  " Vkp-j  ^kp’ 

1 * j * P-1. 

Rather  than  computing  the  entire  a and  y vectors,  a more  efficient 
technique  is  to  use  LU  and  UL  processes  on  the  A blocks: 

for  each  k = 1,2,...,N2  + 1, 

(LU  process  on  A^) 

^k, 1 = akp-p+l ’ ^k.l  = bkp- p+1 ’ 

^,1  Vkp-pH-l’ 
for  j = 2 .... ,p- 1 

solve  tAk, j- 1 Ck,j-1  Fk, j- 1 ^ 

'^k,j-l  Ckp-p+j-  1 \,j-l^ 

\,j  ^P-P+j  Ak,j-1 

"k.j  bkp-fH-j  akp-p+j  Ck,j-1 

■\,j  Vkp-fH-j  akp-p+j  Fk,j-1 

SOlve  ^.p-l^p-l  \p-l  \p-ll 

'°Tc,p-l  Ckp-1  "^k.p-l" 
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for  each  k = 1,2,...,^, 

(UL  process  on  A^+^) 

3k+l,-l  " bkp+p-i;  Vk+1,-1  = Ckp+p-l’ 

^+1,-1  Vk]>4-p-l 

for  j = 2, . .. ,p-l 

solve  ®k+l- j+l  °k+l,-j+l  Fk+l,-j+ll 

r akp+p-  j+1  vk+l j+1  ’\+l j+1  ^ 

B = b — c A 

k+ 1 , - j kp+p-j  kp+p-j  H<+l,-j+l 


y 


Ckp+p-j  Ck+l,-j+l 


k+l,-j 

co.  — v — c F 

_k+l,-j  kp+p-j  kp+p-j  k+l,-j+l 

solve  ^kp+l  r\pH-l^ 

^•^.p+l  Vk+l,-p+l  C°k+ 1 , - p+ 1 - 

The  block  tridiagonal  system  A^x^  = w^  is  now  generated  by  (6.1)  and 
solved  by  some  procedure,  and  we  take  advantage  of  the  earlier  LU  process 
for  in  the  back  substitution: 

for  each  k = 1,2 N2  + 1, 

^kp-l  ^kp-1  \p-l  *kp-p  " Vkp-1  \p 
for  j = p-2 1 

’'kp-jH-j  Fk,j  \,j  *kp-p  " Ck,j  *kp-p+j+r 


6 . B . Convergence  Rates 

As  observed  in  Section  2,  if  j)  B[A]||<  1 then,  under  the  K partition- 
ing for  p-fold  reduction,  we  have  ||  B[£]||  £ ||  B[A]||  . By  Corollary  4.1 
it  follows  that  p-fold  reduction  will  yield  ||  A^  ||  £ ||  ff||  = ||a||  and 
II  B[A(  ^ ] ||  s ||  B[£]^  ||  s ||  B[A]||^.  In  fact,  we  can  provide  pth  order 


L 
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convergence,  although  our  first  result  is  somewhat  weaker  than  might  be 
expected  in  comparison  to  earlier  results. 

We  will  actually  consider  a natural  generalization  of  odd-even  elim- 
ination. Let  Mp  be  a nonsingular  block  (2 p— l)-diagonal  matrix, 


M ( m , _,...,  m m . , ) . 

P J j'P+1  J,0’  j,p-l  N 


We  want  to  pick  M such  that 
P 


M A = (r . , 0,  ...,  0,  s.,  0,  . . . ,0,  t) 

p J j j 


C2) 

MpA  is  a block  (2p+l) -diagonal  matrix;  A is  constructed  by  selecting 
the  intersections  of  the  block  rows  p,2p,...,  and  block  columns  p,2p,... 


Thus  we  have 


(2)  , (2) 

a = r , b x = s 
j JP  j jP 

c<2)  = t.  , W<2)  - (M  v).  . 
J JP  J P JP 


Now,  the  jth  block  row  of  M A is 

P 


~^k=-  p+1 


m.  [block  row  (j+k)  of  A], 
j > k 


so  that 


rj  = mj  ,-P+l 

sj  = mj,-icj-i  + mj,obj  + raj,+i  Vi 

ti  = mj,p-i  cj+P-i' 


We  require  that 


mf,  , c . , , + m,  , b , . + m,  , , a,  , , - 0, 

j,k-l  j+k-1  j,k  j+k  j ,k+l  j+k+1 

I k | = 1 >2 , . . . ,p- 1 , 
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where  we  take  m.  = 0,  m.  =0. 

J.-P  J.P 

The  special  case  p = 2,  = (-a^b^  ^,1,-c^b^^),  has  been  considered 

in  Section  4. A.  For  p > 2,  define 


d.  = b. 

J.-P+l  J-p+1 


dj,-k 


‘‘j.-k 


j,k 


Vj,k 


- b . . - a.,  d . ^ , ,c.,  ,, 

j-k  j-k  j ,-k- 1 j-k- 1 

k = p-2,...,l 
aj~k  d j , - k- 1 ’ 


k = p-2, ... ,0 


= b. 


j.P-1  j+P-1 


— |)  « p rj  ra 

j+k  j+k  j , k+ 1 j+k+ 1 ’ 
k = p-2 ) t * « i If 

= -c  d_1 

j+k  j,k+r 

k = p-2 0 . 


Then  we  mav  cake 


j,-k  ^j.O  “j.-l  “j.-k+l 


a . 


1.0 


m.,  =v.  .v.  ^ ...  v , 

J »k  j,0  j,l  j , k- 1 ’ 


k = 1 ,2  , . . . , p-  1 . 


The  d , 's  are  derived  by  an  LU  factorization  process,  and  the  d 's 
J>“K-  j >+k 

by  a UL  process.  It  follows  from  earlier  results  that  if  ||  C[A]||^<  1 
then  [j  m^  s ||  , so  the  diagonals  of  decrease  in  size  as  one 

moves  away  from  the  main  diagonal. 

Our  first  theorem  partially  generalizes  Theorem  4.1. 
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Theorem  6.1.  Let  J = (-s.^r.,  0,  ....  0,  -s.^t.),  B = (-b.^a.,  0,  -b.^c,), 

JJ  J J J J J J 

B =■  L + U,  J = JL  + Jy,  II  L ||  S a,  II  U ||  ^ a,  ||  B ||  s 2m  = 8,  with  * s 


Then  ||  J||  * 8?,  ||jl||  s , ||  JyH  * ±3>P . 

Proof.  Let  RQ  = 0,  ^ = L(I  - R^'V  TQ  = 0,  Tfc  = U(I  - T^)'^, 

k=l,...,p-l,  S=R  i+T  , , so  that 

p-  1 p-  1 

(I  - S)JL  = L (I  - Rp_2)"1L(I  - Rp_3)"1L  ...  L(I  - R0)-1L, 

(I  - S)JIT  = U(I  - T „)'1U(I  - T ,)-1U  ...  U(I  - T.)-1U. 

U p-Z  p- j 0 

2 - 1 

Define  '{r  = 0 > t,  = a (1  - i,  .)  , so 

0 k k- 1 


Kii*  V llTklls  V 

||JLI|,  II  Jyll^oP/td-  2Vl)7^  (1-  V], 
* 0p/[2p_I(l  - 2*  ^Ijd-  tk)]. 


Define  50  = °’  51  = l’  5k+l  = 5k  ' * 6k-l’  30  6k  = det(a'1>a()(k-l)x(k-l)>  °* 

Vk  = 3 V 5k+l’  1 ' *k  = 'k+2  'k+1’  (1  " 2ip-l)  “k=0(1  " V 


(1  - 2 i ) 5 = 5 


Define  T = 5.  l1  - a 5 so  t = 1, 


p-1  p p+1  p-1  k k+1  k-1  L 

T2  ~ 1 - 2 or*" , ~k  = Tk-1  - a2Tk_9.  By  induction,  Tk(-)  = 2 k+1, 

d or  = -2nk6k  ^ £ 0.  3ut  then,  for  m S 4,  ~k(»)  2 Tk(-|)  , so  2P  («»)  2 1. 


Thus  ||  J||  s SP  and  ||  jJ|  , ||  jJ|  * j3P. 


QED. 


Corollary  6.1.  If  ||B[A]||<1,  ||l[B[A]]||,  ||  U[B[A  j ] ||  * \ ||  B[A]  ||  , then 
||  B[A(2)]||  * ||  B[A]  ||  P. 

In  generalization  of  Theorem  4.4,  there  is  the  very  surprising 


Theorem  6.2.  With  X(l)  = X(A(l)),  X(1)  SI,  y.(l^  = (l-./l-  X(l) )/ (1+.  4- \(i)>  , 


we  have 
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\(i+D  ^ X(i)PfJ_±-i£ 


Proof . It  suffices  to  consider  i = 1,  so  we  xjill  drop  the  superscripts. 
Define  e.  = b ^ d.  , T] . = b . ^ d , k = 0,...,p-l,  so  that 

J,-k  j-k  j ,-k  j > k J+*^  J ,k 


= (-a,  a5 


j j,-l'v  j-l  “j-l  3,-2 

), 


.-1  . , ,-l  -1  . 

j- P+2  a j-p+2  '’j.-p+l 


X (b"1  . a.  ,, 

J-p+1  j-p+1 


(2)  = r.  , 

a • JP 

3 

t.  = (-c  . 71 . 1 .)  (-b  . 1 c 

J J J,1  J+l  J+l  3,2 

X (b- 1 , a . ,) , 

v j+p- 1 j+p-  1 


,-1  -.-1 
j+P- 2 C j+p-2  j,p-l 


c(2)  - t.  , 

J JP 

h(2)  - « 
b , ■“  s . j 

j jp 

v - 1 , -1  1 , - 1 

s . = b . - a.  e . b.  , c.  , - c.  T.  b a 

J J J J.-l  J-l  J-l  J J.l  J+l  J+l 

= b . (I  - a.)  . 

J J 


Now 


’ 11  ^p-i11  3 1#  ii  7ij!kii s (i  ■ xit  ^jlk+iii /A)"1-  By  definins  ei  = i. 

Ek  = (1  • XEk-l/4)"1>  We  haVe  II  ^Jp-kH  * V Similarly,  ||  * Efc. 

This  yields  ||  c ||  * XE^/2,  ||  (I  - CTj)-1||  £ 2/(2  - \E  ^ . To  bound  \(2) 
observe  that 

4||b<2>-1a<2)ll  II  bP)-1c(2>  " 


4 I)  (I  - a.  ) 
JP 


j-l  j-l" 

'ln  " \1  a.  II  II  e:1  J|  ||  b}  , a. 

jp  JP  1 1 JP.-l  1 JP-1  JP 


x [l  ejP.-2,!  •••  ” 'jJ.-pU11  11 'Wl  “jp-P+11 


x 11  (1  • w 


-1 


--1 


X I 71 


-1 

JP-P.2  1 


b”  1 c . 

JP-P  JP-P 

JP-P»P“  1 II  II  b JP-  1 C JP-  1 1 


-1 


jp-P.l"  11  jp-P+1  jp-P+1 


s 4 


( — 2-A  (hp  nr;E2. 

V2  - XE  J 4 j-l  j 

x p- 1 
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(2) 

Thus  is  bounded  by  the  above  quantity.  Suppose  first  that  X - 1, 

which  implies  that  y,  = 1.  Then  E^  = 2n  ' (n+1)  , the  bound  simplifies  to 

XP,  and  we  are  done  with  this  case.  Now  suppose  that  X < 1.  It  may  be 

shown  by  induction  that  E^  3 (1  + u.)(l  - u.n)  (1  - y,n+^)  , yielding 

(2/(2  - XE  ))~  = (1  - X)  ^((1  - ilP)  (1  + u,P))"">  and  with  a little  effort 
P”  i 

the  bound  simplifies  to  the  desired  result.  QED 

Note  that  we  have  not  actually  shown  that  X^  £ 1 implies  ■£  1. 

This  point  is  now  to  be  resolved. 


Corollary  6.2. 
X^  < 1,  and 


If 


(i+1) 

u 


s:  1 then  \(i+1) 
(i)  p 


, (i)P  . , 

£ \ with  strict 


inequality  if 


Proof . The  case  X^  = 1 was  considered  in  Theorem  6.2.  Suppose  that 
0 \ < 1,  again  considering  only  the  case  i = 1.  It  is  not  hard  to  show 

1 Lp/2J  i 

that  ((1  + n),/2>P(2/(l  + yP)>  = (7.  (X))"  , where  n (X)  = (P.)(l  ' • 

p P ~i=0 

Now,  II  (1)  = 1,  7'  (X)  < 0 for  0 £ \ < 1,  so  n (X)  > 1 for  0 ^ \ < 1 and 
P P P 

X^“^  < XP  for  this  case.  Assuming  only  \ £ 1,  and  since  \(1  + y,)-/4  = u, 
we  have  X^  S y,p(2/(l  + y,P))%  which  implies  y/^  ^ y,P . QED 


6.C . Back  Substitution 

Error  propagation  in  the  back  substitution  for  p-fold  reduction  de- 
pends on  the  particular  implementation.  In  the  simplest  form 


*kp-j  ^kp-  j \p-j  *kp-p  ^kp-j  Xkp  ’ 


errors  in 


*kp-  j 


. depend  on  newly  generated  rounding  errors  and  propagated 


errors  from  x,  and  x,  . We  can  write  the  computed  solution  as 
kp-p  kp 

y,  , = x,  . + %.  , to  find 

ykp-j  Kp- j kp-  j 
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'kp-j  ^p-j  'kp-p  'kp-j  ^kp  + ekp-j" 

It  follows  from  the  analysis  of  the  Gauss- Jordan  algorithm  (Corollary  3.5) 
and  from  explicit  formulae  for  \p_j>  j>  that  ||  l|  £ ||  B[A]|f  J, 

II  \p-jH  S II  B£A]|P  if  II  BfAlll<  which  leads  to 

II  Skp-jH  * H »[AiirJll  Vp«+  II alt’ll  5kPl,+  II  V-jll- 

Errors  in  a newly  computed  component  therefore  depend  most  strongly  on 
errors  in  its  closest  back-substituted  component,  and  the  "influence" 
decreases  as  the  distance  between  components  increases.  Moreover,  we 
have 

II  §kp-jl(  s II  B[A]H  VpH’  II  Spl^ + II  V-jll* 

If  e is  a bound  on  errors  introduced  at  each  stage,  it  follows  that  errors 

in  the  final  computed  result  will  be  bounded  by  e lo?  N. 

P 

Slightly  different  results  prevail  in  the  back  substitution 


^p-l  ^kp- 1 \p-l  ^D-p  ’Vp-l  ^p 
for  j = 2 , . . . , p- 1 

‘kp-j  Fk,p-j  Ak,p-j  Xkp-p  Ck,p-j  ’Scp-j+l 


Errors  in  the 


computed  solution 


' kp-j 


*kp-J  + "kp-j  obey  the  rule 


5?  pm 

skp-l  'kp-l  "kp-p  " Vkp-p  "kp  + Skp-1 

for  j = 2 , . . . , p-  1 

"kp-j  " -Ak,p- j "kp-p  " Ck,p-j  ^kp-j+1  + €kp-j’ 

so  the  errors  in  x depend  on  propagation  from  x.  and  x,  , , rather 
KP  J xp-p  kp-j+1 

than  x^p.p  and  ’Stp’  This  iroplies  a linear  growth  as  in  the  LU  factorization 
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in  conjunction  with  logarithmic  growth  as  in  odd-even  reduction.  If  we 
again  suppose  that  ||  e^^jj  <:  e,  with  ||  ?kp_p  |(,  ||  ckp||  * ?,  it  follows 
that 

I!  ?Up_ ! II  * II  B[£]||  ? + e, 

II  Skp.jH  * ||  B[Xj  ||  max(5,  ||  lkp_j+1||  ) + e 

which  leads  to  ||  ?kp_j||  s % + (p-l)e.  The  end  result  is  that  errors  in 
the  final  computed  result  will  be  bounded  by  ep  log  N. 
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7.  SEMIDIRECT  METHODS 

Briefly,  a semidirect  method  is  a direct  method  that  is  stopped  be- 
fore completion.  In  common  usage,  a direct  method  for  the  solution  of 
Ax  - v computes  x exactly  in  a finite  number  of  steps  if  exact  arithmetic 

is  used,  while  an  iterative  method  computes  a sequence  of  approximations 

( i)  * 

x converging  to  x.  The  semidirect  methods  are  intermediate  on  this 

scale,  producing  an  approximation  to  x in  a finite  number  of  steps, 

while  an  additional  finite  number  of  steps  could  produce  x exactly.  During 

the  computation,  intermediate  results  are  generated  which  approximate  the 

solution.  If  the  error  is  small  enough  then  the  algorithm  is  terminated 

prematurely  and  an  approximate  solution  is  returned. 

The  method  of  conjugate  gradients  is  an  example  of  a semidirect 
method  since  it  can  produce  a good  approximation  long  before  its  usual 
termination  [R2].  The  accelerated  Parallel  Gauss  iteration  ([H4]  and 
Section  8.E)  is  another  example,  as  is  Malcolm  and  Palmer's  convergent  LU 
factorization  [M2]  of  A = (a,b,a) , a and  b positive  definite  with  o (b  *a)  < -7. 

Here  we  consider  semidirect  methods  based  on  the  quadratic  conver- 
gence properties  of  the  odd-even  methods.  This  will  generalize  various 
results  obtained  by  Hockney  [H5],  Stone  [S2]  and  Jordan  [Jl].  Buzbee  [B17] 
discusses  a similar  technique  based  on  the  Buneman  algorithm  for  Poisson's 
equation  (cf.  Theorem  4.3). 


The  name  semi- iterative  has  already  been  appropriated  for  other  purposes 
[V3],  Stone  [S2]  initiated  the  name  semidirect. 
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7 .A.  Incomplete  Elimination 

Let  us  first  consider  odd-even  elimination  for  Ax  = v,  which  is  the 
process  H1  = A,  v(1)  = v,  H = (I  + C[H.])H. , v(l+1)=  (I  + C[H.])v(l'1 . 
In  Theorem  4 . 1 we  proved  that  if  |j  B[A]||<  1 then  ||  B[Hi+^]||  ^ ||  B[H^]“  |j, 
so  that  the  off-diagonal  blocks  decrease  quadratical ly  in  magnitude  rela- 
tive to  the  diagonal  blocks.  This  motivates  the  approximate  solution 
z^^  = D[1L]  ^v^\  Because  we  halt  the  process  before  completion,  we 
call  this  technique  incomplete  odd-even  elimination. 


Theorem  7.1. 


x - z*-1'1  II  / II  X | 


B[H.]||. 


Proof . Since  v^  = H.x  = D[H^](I  - B[H^])x,  we  have  z^ 


x - B[H.]x. 


D[H.]_1v(i) 

QED 


Thus  the  B matrix  determines  the  relative  error  in  the  simple  approxi- 
mation z ^ ^ . If  3 = ||  B[A  ] ||  < 1,  0 < e < 1 , m = ! log  N”  , then 


(7.1)  k=  1 + max(0,  min(m. 


log 


l°g9  e 
2 V log  ? 


)) 


guarantees  |j  B[R^]||  £ e.  For  example,  if 


a = — 


f20  = 10-6,  then 


k = 6 and  we  should  have  N > 32  for  incomplete  odd-even  elimination  to  be 

2^  — 32  — o ^ 

applied.  In  fact,  |j  B[Vl^]  ||  £ 3 =2  = 10  ”,  so  the  results  of  the 

incomplete  elimination  would  be  much  better  than  required. 

In  the  case  n =»  1,  a^  = c^  = a,  b^  - b,  we  have  ||  B[H^+^]||  * 

II  B[Hi ]2  }|  / (2  - ||  B[H.]2||  ).  Since  x2/2  < x2 / (2  - x2)  < x2  for  0 < x < 1, 


k must  be  larger  than 
(7.2)  1 + log 


log2(e/2)' 
2l  log2(3  2) 
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in  order  to  have  ||  B[H,  ]||  £ e.  (By  induction,  3 > 2(9/2}  , and  if 

2k_  ^ 

2(3/2)  > e then  3 > e.)  A more  refined  lower  bound  is  obtained 

2 2 

by  defining  t = 3,  t . ~ T (2  - t ) , and  computing  the  largest  k such 
i rtf-  in  n 


(7.3)  xR  > e . 


Since  B[1L  ] is  the  matrix  associated  with  the  block  Jacobi  iteration  for 
H^x  = v^  , when  ||  B [ H^. ] | j is  small  we  can  use  this  iteration  to  improve 
the  estimate  . pn  fact,  the  choice  z^  = of h ] * is  the  iterate 

- iC 

following  the  initial  choice  z ^ = 0.  If  z^k’^  is  an  approximation  to 

x and  z(k»1-+1)  = B[Hk]z(k,l)  + D[Hk]’1v(k)  then  ||  x - z(k’^||  <: 

||  B[H,  ]^||  ||  x - z(k’°^  ||  . k and  l may  now  be  chosen  to  minimize  computa- 

2^  1 £ 

tion  time  subject  to  the  constraint  3 £ e.  Any  of  the  other  standard 

iterative  methods  may  be  used  in  place  of  the  Jacobi  iteration;  for  related 
results  using  the  theory  of  non-negative  matrices  and  regular  splittings, 


see  [HI], 


We  have  remarked  earlier  that  if  the  tridiagonal  matrix  A = (a.,  b.,  c.) 

j j J 


is  irreducibly  diagonally  dominant  and  9' 


B[H^]||  then  3^  £ 1 and 


^ 9<‘i'1 2 . This  is  a case  of  practical  importance  in  differential 
equations.  However,  the  example  A = (-1,  2,  -1),  for  which  9^  = 1, 

1 £ i ^ m-1,  3^  = 0,  shows  that  strict  inequality  is  necessary  for  incom- 
plete elimination  to  be  effective. 


Indeed,  when  3^  is  very  close  to  1,  as  will  be  the  case  for  most 
second-order  boundary  value  problems,  the  fact  that  we  have  strict  in- 


equality does  not  imply  that  3V  will  be  small  for  i < m.  With  regard 
to  the  particular  approximation  z^  = D[H^]  v^\  z ^ ^ depends  only  on 
v £,  j - (2i_1  - 1)  £ l s j + (2 1-1  + 1) . Thus  for  N = 2m  - 1,  i < m, 


86 


( i.) 

the  center  point  2^m  ^ will  be  independent  of  the  boundary  data.  Al- 
though this  is  not  the  only  possible  approximation,  complete  odd-even 
elimination  will  be  necessary  to  ensure  convergence  to  the  continuous 
solution . 

7 ,B . Incomplete  Reduction 

The  incomplete  odd-even  reduction  algorithm  is  analagous  to  incomplete 
odd-even  elimination.  Complete  reduction  generates  a sequence  of  block 

tridiagonal  systems  A^x^  = w , i = l,...,m,  m = ~log?  N+  1~  , solves 

, (m)  (m)  (m)  , , (m)  , , , „ . , 

Ax  - w exactly  for  x , and  back  substitutes  to  finally  obtain 

(1)  . , . , (k)  (k)  (k)  , 

x = x . Incomplete  reduction  stops  with  Ax  = w for  some  k be- 

Ck)  (k) 

tween  1 and  m,  solves  this  system  approximately  for  y *3  x , and  back 
substitutes  y^  to  obtain  y = y^  rs  x^  . 

We  remind  the  reader  of  observations  made  in  Section  4.D,  especially 
that  roughly  half  the  actual  work  is  performed  in  the  first  reduction,  one 
quarter  in  the  second,  and  so  forth.  Thus  the  savings  in  incomplete  reduc- 
tion would  be  much  less  than  those  in  incomplete  elimination,  where  the 
work  per  step  is  roughly  constant.  For  more  details,  see  Section  10. C. 

In  contrast  to  incomplete  elimination,  the  incomplete  reduction  must 
deal  with  error  propagation  in  the  back  substitution.  We  have  already 
shown  that  roundoff  errors  in  complete  reduction  are  well-behaved,  and 
can  only  grow  at  most  logarithmically.  But  since  the  approximation  errors 
quite  possibly  might  be  larger  than  the  roundoff  errors,  a different  analy- 
sis is  needed.  We  will  see  that  a number  of  interesting  properties  result, 
including  no  growth  of  errors  under  certain  assumptions  about  roundoff, 

* 

This  observation  is  due  to  an  anonymous  referee  of  [H2]. 
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and  a superconvergence- like  effect  due  to  damping  of  errors. 

An  initial  approximation  y^  may  be  computed  from  and  w^  ac- 

cording to  our  analysis  of  incomplete  elimination.  The  choice 

= D[A^^  ] lwOO  is  a simple  one,  and  we  have  -y^'*  [j  ^ |(B[A^^]| 

The  parameter  k should  be  determined  both  for  computational  economy  and 
for  a suitable  approximation.  Here  the  quadratic  decreases  of  B[A^  ] are 

important  and  useful  in  some  (but  certainly  not  all)  situations. 

(k) 

Regardless  of  the  source  of  the  approximation  yv  , let  us  suppose 

y 

(id-1) 


(k)  (k)  (k)  (i) 

that  y^  = x)  + In  the  back  substitution,  y is  computed  from 


by 

y<»  = y<i+1) 

y2j  Yj 

(i)  = (b(D  )-l(w(i)  . a(D  d+1)  . c(i)  yCi+D)  + e(i)  , 

y2j-l  C 2j-l;  (W2j-l  a2j-l  yj-l  C2j-1  yj  ; + 2J-1* 

with  representing  rounding  errors.  We  then  have 


A(i)  = (i+1) 

2j 

6(i)  = e (i) 

2j-l  2 j-  1 


/h(i)  x"l,,(i) 

( 2j-l}  a2  j- 1 >1 


+c<«  5<i+1)) 

2j-l  J 


Theorem  7.2.  If  j|  s(l)  |j  S (1  - ||  B[A(l)  ] ||  ) 


'<i+1>  II  then  |H(i)  ||  - ||6<1+1)|t 
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As  in  the  analysis  of  roundoff  error  propagation  for  complete  odd-even 
reduction,  we  have 

i(i-)  II  /II 


||  5^|l  ^ax(||  6^'  ||, 


s(l)||+  ||  B[A(l)]||  ||  5(l+1)  ||  ). 


From  this  expression  and  the  hypothesis  we  have  ||  6^  ||  £ ||  |j  . 

But,  also  by  construction,  j|  5^  ||  ^ ||  ||  . 


QED 


This  theorem  requires  some  explanation.  First  suppose  that  no  round- 


ing errors  are  committed,  so  that  = 0 and  all  the  error  in  y^  is 


.00 


derived  from  the  approximation  error  in  y . It  now  follows  from 
|j  B[ A ] ||  < 1 that  this  error  will  not  grow  during  the  back  substitution. 


From  | x 


00 


s ||  X ||  , we  conclude  that  ||x-y  ||  / ||x  ||  ^ -y  ^ ||  ' || 


In  fact,  it  is  easy  to  see  that  while  1 1 x — y||  = ||  x^  - y <'k'>  ||,  many 


vector  components  of  x-y  will  be  much  smaller  in  size  than  J|  x - y j j . 

In  Figure  7 . 1 we  illustrate  the  errors  by  component  for  the  tridiagonal 


system  (-1,  4,  -l)x  = v,  >1  = 31,  k = m-1  = 4,  v chosen  so  that  Xj  = 1, 


and  y^  = b^k^  “w^k^  . In  the  absence  of  rounding  errors,  the  back  sub- 


stitution quickly  damps  out  approximation  errors,  creating  a very  pleasing 
superconvergence- like  effect. 


Now  suppose  that  rounding  errors  are  committed  and  we  have  the  upper 


bound 


.(i) 


£ e.  If 


;(k) 


^ ve  for  some  moderate  constant  v then  we 


have  essentially  solved  the  system  A^x^  = w ^ exactly,  and  the  error 


propagation  analysis  is  the  same  as  for  complete  reduction.  That  is,  we 

(i) 


have  ||  ||  ^ (k  - i + v)e,  1 ^ i s k. 

;(k) 


The  remaining  case  is  when  ||  ||  » e.  Here  the  semidirect  method 

must  be  considered  in  competition  with  an  iterative  method.  The  hypoth- 
esis of  Theorem  7.2, 


6^  ||  (1  - ||  B[A^  ] ||  ) 5 e,  should  be  viewed  as 
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an  attempt  to  formalize  the  condition  j|  5^  ||  » e . As  long  as  the  hypoth- 
esis remains  true,  the  difference  (in  norm)  between  x^  and  y^  can  en- 
tirely be  attributed  to  the  original  approximation  error.  As  soon  as  the 
hypothesis  is  violated  it  is  possible  for  the  errors  to  grow,  but  we  will 
still  be  able  to  say  that  ||  x - y||  = ||  6^  ||  +•  0(ke)  . Although  the  super- 
convergence effect  would  be  moderated  somewhat  by  roundoff,  its  qualitative 
effects  will  still  be  felt  and  we  can  expect  this  bound  on  accumulated  er- 
ror to  be  an  overestimate. 


7 . C.  Applicability  of  the  Methods 

We  close  with  some  remarks  on  the  practical  application  of  the  qua- 
dratic semidirect  methods.  In  Buzbee's  analysis  [B17]  of  the  Buneman  algo- 
rithm for  the  differential  equation  (-A  + d)u  = f on  a rectangle  with 
sides  c ^ and  c , da  nonnegative  constant,  it  is  shown  that  if 


1 2 
T2  ^ 2 


+ 


_2 


] » 1 


then  a truncated  Buneman  algorithm  will  be  useful.  Geometrically  this  re- 
quires that  the  rectangle  be  long  and  thin  if  d is  small,  and  hence  the 
block  tridiagonal  matrix  will  have  small  blocks  when  properly  ordered. 

By  directly  considering  the  finite  difference  approximation  to  Poisson's 
equation  on  a rectangle  with  equal  mesh  spacing  in  both  directions  it  is 
possible  to  reach  a similar  conclusion;  this  also  serves  as  a simple  model 
for  more  general  elliptic  equations.  With  M nodes  in  one  direction  and  N 
in  the  other,  M < N,  the  matrix  A is  (-1^,  P^,  - 1^) ^ , P = (-1,  4,  -1)  . 

It  is  readily  verified  that  ||  B[A]j|=  2 | j QM 1 1 , = P^1, 
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\k 


l D + D 1 

_ rn  m-  i.v  « 

-(1 ),  n = 2m, 


2D 

j(l  - n = 2nrt-l, 

a 


»0'  l>  Dl'4>  Dn'4  Dn-1  “n-2 * 

The  following  table  gives  j|  B[A]||  for  M = 1(1)6,  along  with  estimates 

on  the  number  of  odd-even  reductions  needed  to  obtain  ||  B[A^  ]||  < 10 


k such  that  ]|  B[A^^]||<  10  ^ 


M 

1 

2 

3 

4 

5 

6 


B[A  ] j 


2 = -5 

! * -67 
| = .86 

If*- 
If*  - 
If*- 


lower  bounds 

(7.2)  (7.3) 


upper  bound 

(7.1) 


8 

9 

10 

11 


These  figures  again  lead  to  the  conclusion  that  the  best  use  of  the  semi- 
direct  methods  for  elliptic  equations  will  be  with  long  thin  regions.  In 
the  next  section  we  consider  some  iterative  methods  which  use  matrix  split 
tings  and  orderings  to  take  advantage  of  this  property. 
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8.  ITERATIVE  METHODS 

In  this  section  we  consider  the  iterative  solution  of  a finite  dif- 
ference approximation  to  a general  nonseparable  elliptic  equation.  Of 
particular  interest  will  be  methods  which  employ,  as  part  of  each  itera- 
tive step,  the  direct  or  semi-direct  solution  of  a block  tridiagonal 
linear  system.  Continuing  our  emphasis  on  methods  for  a parallel  computer, 
we  want  this  system  to  be  such  that  the  method  already  considered  can  be 
used  efficiently. 


8 .A.  Use  of  Elimination 

Let  us  first  note  that  several  of  our  theorems  about  block  elimination 
have  applications  to  iterative  methods.  Consider  the  system  Ax  = v and 
the  block  Jacobi  iteration  * B[A]x^  + D[A]  ^v.  Suppose  that 

|j  B [ A ] 1 1 < 1,  so  the  iteration  converges;  this  condition  is  satisfied  for 
many  elliptic  equations  with  suitable  partitioning  [V3],  If  the  structure 
of  A can  be  simplified  by  an  elimination  step,  forming  A'x  = v1 , then 
subsequent  computations  will  be  handled  more  efficiently,  and  since 
jj  B[A'  ]j|  s jj  B[A]||  the  iteration  will  not  converge  any  slower. 

Indeed,  convergence  of  the  block  Jacobi  iteration  will  often  be  quite 
slow.  When  A is  also  positive  definite  the  Jacobi  semi- iterative  method 
(J  - SI  or  Chebychev  acceleration)  [V3] 


x(l+1)  = vi+1(B[Alx('l)  4-  D[A]_1v  - x(l_1))+  x(l"lV 


vx  - 1,  v2  - 2/(2  - o ) , 

V.+1  = (1  - P2v1/4)“1,  o = o(B[A]), 


will  converge  at  a much  faster  rate  and  is  a natural  candidate  for  parallel 
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computation.  If  we  use  0 = ||  B[A]||  as  an  upper  bound  on  p and  use  the 
sequence  of  parameters 


* * , ? 
v1  = 1,  v2  = 2/(2  - 0Z), 

Vi  = (1 


2 * , - 1 
3 v./4)  \ 


then  the  J-SI  iteration  for  A'  will  not  be  slower  than  that  for  A.  By 
"preprocessing"  Ax  = v we  can  expect  that  the  actual  computation  time  will 
be  smaller;  unfortunately  a complete  proof  of  this  claim  is  not  at  hand. 

Juncosa  and  Mullikin  [J2]  discuss  the  effect  of  elimination  on  itera- 
tive methods  when  A = I - M,  M 2 0,  m = 0,  and  m SI  with  strict 

s 

inequality  for  at  least  one  value  of  r.  Their  particular  interest  for  el- 
liptic boundary  value  problems  is  to  order  the  boundary  nodes  followed  by 
the  interior  nodes,  and  then  to  eliminate  the  boundary  nodes.  This  would 
be  useful  for  domains  with  curved  or  irregular  boundaries.  It  is  shown 
that,  with  A'  = I - M'  representing  the  system  after  eliminating  one  point, 
p(M')  £ p (M) . Thus  the  point  Jacobi  iteration  will  not  be  slower.  Also, 
if  G = (I  - L[M])  ^ U[M]  is  the  matrix  describing  the  Gauss- Seidel  itera- 
tion, then  o(G')  ^ p(G).  Similar  results  follow  from  work  on  G-matrices 
[B4].  As  in  the  proof  of  Corollary  3.2  we  obtain  corresponding  results 
for  block  elimination,  though  not  under  the  more  general  assumption 


II  B[A]  ||  < 1. 


8 . B.  Splittings 

The  block  Jacobi  iteration  is  created  by  "splitting  off"  the  block 
diagonal  portion  of  A,  and  using  the  iteration  D[A]x^+^  = (D[A]-A)x^  + v. 
Many  other  iterative  methods  are  also  defined  by  a splitting  A = - M2 , 

MjX*'i+^  = + v.  When  is  a block  tridiagonal  matrix,  various  direct 

and  semidirect  methods  can  be  used  to  solve  the  system  for  x^ 
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A particularly  effective  procedure  discussed  by  Concus  and  Golub 
[C4]  is  readily  adapted  to  parallel  computation.  The  equation 

s -7  • (a(x,y)vu)  = f 

on  a rectangle  R,  a(x,y)  > £)  and  sufficiently  smooth , ^undergoes  a change 

~2  ~~2 

of  variable  w(x,y)  = a(x,y)  u(x,y)  and  scaling  by  a to  yield  the  equiv- 
alent equation 

- Aw  4-  p (x , y)  w = q 

.11  1 

~ 2 2 ~ 2 

on  R,  p(x,y)  - a A(a  ),  q = a f.  The  shifted  iteration  (-A  + k)w  , 

n-r  I 

= (k  - p)wn  + q,  k a constant,  is  solved  in  the  discrete  form 
(-A^  + KI)w^n+^  = (KI  - P)v(n)  + Q,  where  P is  a diagonal  matrix.  Since 
the  system  (-A^  + KI) b = c may  be  solved  by  the  fast  Poisson  methods  and 
Chebychev  acceleration  is  applicable,  the  overall  method  is  highly  paral- 
lel and  rapidly  convergent. 


8.C.  Multi- line  Orderings 

There  are  a number  of  more  classical  iterative  schemes  which  involve 
solution  of  block  tridiagonal  systems.  We  begin  with  the  multi-line  itera- 
tions for  a rectangular  grid;  the  ordering  of  unknowns  and  partitioning 
for  2- line  iteration  on  a 6x6  grid  is  shown  in  Figure  8.1.  A diagram  of 
the  resulting  matrix  is  shown  in  Figure  8.2. 


The  underlying  technique  of  the  multi- line  methods  is  to  divide  a 


square  or  long  and  broad  rectangle  into  a set  of  long  and  thin  rectangles. 
A discrete  form  of  the  elliptic  equation  is  then  solved  over  each  subrec- 


tangle as  part  of  the  iteration.  By  partitioning  the  grid  into  groups  of 
a few  lines  we  obtain  systems  that  can  be  solved  well  by  general  block 


■■■■ 
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methods,  particularly  the  semidirect  methods  of  Section  7. 

The  multi- line  block  Jacobi  iteration  with  Chebychev  acceleration 

is  probably  the  simplest  method  to  implement  on  a parallel  or  pipeline 

computer  and  should  be  considered  as  a benchmark  for  other  methods.  With 

an  nxn  grid  and  a k-line  iteration,  n = kn' , the  splitting  A = is 

such  that  consists  of  n'  decoupled  block  tridiagonal  matrices,  each  of 

which  has  kxk  blocks  with  N =*  n.  By  storing  zeros  in  strategic  locations 

may  be  regarded  for  computational  purposes  as  one  block  tridiagonal 

2 , 

matrix  with  kxk  blocks  and  N = nn'  = n ,/k.  Lambiotte  [LI]  points  out 
that  only  log  n odd-even  reductions  are  needed  because  of  this  construc- 
tion, rather  than  log  N reductions  as  for  the  general  case.  When  incom- 
plete odd-even  reduction  is  used,  the  number  of  lines  may  be  chosen  ac- 
cording to  the  guidelines  given  in  Section  7.C  for  Poisson's  equation. 

8 .D.  Spiral  Orderings 

Another  useful  ordering  of  unknowns  for  parallel  computation  is  the 

single  spiral  order  diagrammed  in  Figures  8.3  and  8.4.  The  spiral  order 

essentially  produces  a very  long  and  thin  rectangular  region,  which  is 

represented  by  the  tridiagonal  matrix  forming  the  central  portion  of  the 

matrix  in  Figure  8.4.  In  the  splitting  A =■  - M0  we  take  to  be  the 

2 2 

tridiagonal  center  of  A.  For  an  nxn  grid  is  n xn  and  has  ||  B[M^]|| 
about  thus  incomplete  odd-even  reduction  can  be  used  quite  efficiently 
to  solve  systems  involving  . 

We  note  that  the  single  spiral  order  has  been  used  with  another  split- 
ting to  yield  the  block  peripheral  method  [B9],  [E2].  Here  the  blocks  are 
of  the  periodic  tridiagonal  form,  and  the  convergence  rate  of  the  periph- 
eral block  SOR  method  is  shown  experimentally  to  be  similar  to  the  2-line 


block  SOR  method  [BIO], 
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The  double  spiral  ordering  (Figure  8.5,  8.6)  is  an  attempt  to  com- 
bine features  of  the  single  spiral  order  with  advantages  of  a 2-cyclic 

1 2 

ordering.  For  an  nXn  grid,  n odd,  one  spiral  contains  -(n  + 2n  - 1) 

1 2 

points  and  the  other  contains  — (n  - 2n  + 1)  points.  Multi- spiral  or 
-peripheral  orders  can  also  be  defined. 

For  the  single  and  double  spiral  orders,  when  n is  even  the  trailing 
submatrix  of  A looks  like 


The  *'d  elements  of  A should  be  included  in  the  implicit  part  of  the  itera- 
tion, M^ . This  can  be  done  at  little  extra  cost,  and  ensures  that  each 
row  and  column  of  has  at  most  two  non-zero  elements.  When  n is  odd  a 
similar  modification  should  be  made. 

The  mapping  from  the  natural  ordering  to  the  peripheral  or  single 

2 

spiral  ordering  is  defined  by  a permutation  of  the  vector  T = (l,2,...,n  ) 

2 

into  a new  vector  S = (S(l) ,S(2) , . . . ,S(n  )).  With  an  nxn  grid,  point  (i,j) 
is  unknown  number  (i-l)n  + j in  the  natural  ordering.  Given  T,  two  vec- 
tors forming  the  coordinates  (i,j)  of  each  unknown  are  computed  by  taking 
quotient  and  remainder: 

q(k)  = L(k-l)/nJ,  (1  £ k s n2) 
i(k)  = 1 + q(k),  (1  £ k £ n2) 
j(k)  = k - nq(k) , (1  £ k £ n2) . 

For  simplicity  let  n = 2m- 1.  Define 


P ( i » j) 


min(i,j,  2ra-i,  2m- j) , (1  £ i,  j £ n) . 


The  point  (i,j)  is  part  of  peripheral  number  p(i,j),  1 ^ p(i,j)  ^ m. 
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Peripheral  p has  8(m-p)  points  in  it,  except  for  peripheral  m which  con- 
tains the  single  point  (m,m). 

The  peripheral  ordering  is  per.  1,  per. 2,  . ..,  per.  m,  where  each 
peripheral  is  taken  in  clockwise  order:  top,  rhs , bottom,  lhs.  Define 

P-1 

f(i,j)  = 8 (m-k)  = 4(2m-p)  (p-1)  , 

‘ls-1 

the  number  of  elements  contained  in  peripherals  1,2 , . . . ,p(i, j)  - 1,  and 


g(i.j)  = f(i,j)  + 


(i-p)  + (j-p)  +1  if  i ^ j, 

8 (m-p)  - (i-p)  - (j-p)  +1  if  i > j 


= 4p(2m-p)  + 1 


-8m  + (if  i £ j) 
- (if  i > j) 


(2p  + i + j) 


The  vector  S (k)  = g(i,j)  defines  the  necessary  permutation,  and  can  be 
easily  evaluated  on  a parallel  or  pipeline  computer.  The  actual  reordering 
of  a vector  may  be  performed  on  the  CDC  STAR  by  using  the  built-in  vector 
permute  instructions,  though  these  should  be  used  sparingly  since  they  are 
quite  expensive.  Once  the  vector  S is  generated,  the  elements  of  A and  v 
may  be  generated  directly. 

Of  course,  the  implementation  of  an  iterative  method  based  on  an 
"unnatural"  ordering  involves  much  more  developmental  attention  than  is  pre- 
sented here.  We  only  mean  to  show  the  potential  for  parallel  computation. 

A more  complete  comparison  based  on  actual  machine  capabilities  has  not 
been  performed.  For  such  a discussion  of  some  other  methods,  see  Lambiotte 
[LI]. 
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8 .E.  Parallel  Gauss 

We  close  our  section  on  iterative  methods  for  block  tridiagonal  sys- 
tems with  some  brief  remarks  on  the  Parallel  Gauss  iteration  due  to  Traub 
[T2]  with  extensions  by  Heller,  Stevenson  and  Traub  [H4].  For  simplicity 
we  consider  the  normalized  form  A = (a^,  I,  c^) . 

In  the  block  LU  factorization  A = ( , I,  0)  (0,  d , cj) > let 
D = diag(d^, . . . ,d  ) . Since  D = I - L[A]D  Hl[A],  a natural  parallel  itera- 
tion is  = I - L[A](D^  ^ ) Hj[A],  where  is  some  initial  estimate 

of  D.  This  forms  the  first  of  three  iterations  making  up  the  Parallel 
Gauss  method;  the  other  two  parts  are  Jacobi  iterations  for  the  L and  U 
block  bidiagonal  systems.  It  is  shown  in  [H4]  that,  with  \(A)  ^ 1, 
u(A)  - (1  - vTrX)/(l  + , we  have  ||  D - D(0  ||  £ u||  D - D(l_1)  ||  . 

Techniques  for  reducing  the  constant  p,  are  given  in  [H4].  As  Stone  [S2] 
points  out,  this  linear  convergence  is  much  less  attractive  than  the  qua- 
dratic convergence  of  odd-even  reduction.  However,  in  certain  situations 
(parabolic  equations  or  the  multi- line  iterations  discussed  here)  it  is 
necessary  to  solve  a sequence  of  related  problems,  and  this  would  hope- 
fully provide  suitable  initial  estimates. 

Let  A'  = (a\,  cj^  an<*  suPPose  that  D'  has  already  been  computed. 

Since  d'.  - d.  = a,  d * c.  . - a'.(d'.  .)  ^c'  , , by  Theorem  3.4 
J J j j-1  j-1  J J-l  j“l 


d'.  - d . II  ^ (h  (• -j— 

j J"  v4  V1  + ,/1-V 


1 


2r-r)  + (r~)  ( 


l + .a-v 


= 1 - jCvO-X  + • ,1-V). 

Thus  ||  D ' - D ||  = max  j ||  dj  - d^  ||  5 1 - -j(>  fl-  X + , ^1-  X' ) , and  we  may  take  D' 
as  an  initial  approximation  to  D.  Now,  the  Parallel  Gauss  iteration  is 
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most  useful  when  X and  X'  are  not  too  close  to  1,  and  in  this  case  good 
initial  approximations  will  be  available  by  the  above  bound  on  J | D ' - D | j . 
When  X and  X'  are  close  to  1 the  bound  will  be  large,  but  in  this  case 
the  iteration  should  probably  not  be  used  due  to  slow  convergence. 

For  tridiagonal  matrices  we  have 


so  that 


Id1.  - d 
1 j ; 


di 

dj  - dj  ‘ ' Vl,dH  'j-! 

- (di-irl(ij  c>i  ■ ai  'j-P- 


1 - ./l-X 


- d. 


i + 777  1 j-i  j-i 


+ 


1 + - 1-  X' 


a c 


. 1 « .v.  . 1 ■ 

J J-l  J J-l1 


Let  A = max  . I a .c  . - a’.c'.  I < (X  + X1  )/4 , 6 . = Id'  - d | , 

J J J-l  J J-l1  J 1 j j ’ 

a = (1  - ..T7x)/(1  + ./l-X')  , b = 2/(1  + - '1-  X' ) . Then  *>  = 0,  5.  s a 5.  , + b, 

1 J J-l 

so  by  induction  5^  £ (1  + a + . . . + a^-2') b < b''  (1-a)  , or  2tJ  (» /l-X  + yl-X’ ) 


By  the  intermediate  value  theorem  (yl-  X + . ;1-  X’  )/2  = .4-X’  for  some  \ be- 


tween \ and  V , so  that  5^  ^ A/-/1-X  . Again  we  have  the  situation  where, 

-'<■ 

if  X is  close  to  1,  a large  upper  bound  results,  but  here  the  iteration 

should  not  be  used.  When  X is  not  so  close  to  1 the  bound  may  give  more 

complete  information  than  the  one  derived  for  block  tridiagonal  systems. 

Let  us  now  consider  the  utility  of  the  Parallel  Gauss  iteration  for 

the  parabolic  equation  ut  = (au^)^,  a = a(t,x)  > 0,  u = u(t,x),  with 

0 s x SI,  0 St,  and  u(0,x) , u(t,0) , u(t, 1)  specified.  Let  u1  c*  u(ik, jh) , 

2 

k = At,  h = Ax,  p = k/h  . The  Crank-Nicolson  method  for  advancing  to  the 
next  time  step  must  solve  the  tridiagonal  system 


/T  , P ni+l\  i+1  /T  P „i.  i 
(I  + - P ) u = (I  - I p )u  , 


P1 


(-  a 


j- 1/2*  °j-l/2  + aj+l/2  * * aj+l/2 


/o)  • 
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We  therefore  have 


a — / i+1  , , i+1  i+1  \ /o  i+1  i~\ 

A Daj-l/2’  1 + P(aj-l/2  + aj+l/2)  " Paj+l/2' 2); 

let  us  now  drop  the  superscript  i+1  for  convenience. 

A will  always  be  strictly  diagonally  dominant  because  a(t,x)  > 0, 

and  we  would  like  to  know  when  we  also  have  \(A)  ^ 1.  Using  the  mean  value 

theorem,  define  x.  and  x.  by 
J J 


aj-i/2  + aj+i/2  - "aj-l/2  " “ °y 
bj  = ax((i+l)k,  Xj) , 

a 3-2/2  + aj- 1/ 2 = 2a j- 1/2  ‘ h V 

f? . = a ( (i+1)  k,  x .) . 

J x J 

After  some  algebraic  manipulation  it  is  seen  that 

\(A)  = max.(l  + (l  + -rhb,))  ^ 

J paj- 1/  2 2 J 

x (1  + (1  - ■§  h S.))”1. 

paj-l/2  2 J 

In  order  to  guarantee  \(A)  s 1 it  is  sufficient  to  assume  th.  t h and  k satis- 
fy ~ h|ax(t,x)|  < 1.  If  m^  = max|ax(t,x)|  this  becomes  the  condition 
m^k  £ 2h. 

_2 

Suppose  we  want  the  more  stringent  condition  \(A)  £ (1  + c)  , c £ 0. 
This  may  be  satisfied  if  we  choose  h and  k such  that,  for  each  j, 


2a  . / + h b . , 


CPaj-l/2  S 1 + Dhbj  2> 
cpa  j 2 £ 1 - phH^ ' 2 . 


With  ra^  = max  a(t,x),  a sufficient  condition  is  cpm^  s 1 • Phm^  2,  or 
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(8.1)  k £ 2h2/(2cm0  + hn^)  . 

For  c = 0,  (8.1)  reduces  to  the  previous  condition  on  h and  k,  while  for 
c = h we  have  k £ 2h/  + m^)  = 0(h) . 

Now,  one  of  the  best  features  of  the  Crank-Nicolson  method  is  that 
k may  be  chosen  to  be  0(h) . We  have  seen  that  a small  value  of  c still 
allows  this  choice.  Conversely,  if  k = rh  satisfies  (8.1)  and  rm^  < 2, 
then  c £ h(l  - rm^/2)/ (rm^)  - 0(h).  When  we  try  to  pick  h and  k so  that 
the  Parallel  Gauss  method  will  be  effective  (c  » h)  we  are  necessarily 
restricted  to  small  time  steps.  When  h and  k are  chosen  in  the  usual  way 
(k  = 0(h)),  Parallel  Gauss  cannot  be  expected  to  be  very  effective  since 
\(A)  will  be  close  to  1.  Thus  the  method  appears  to  be  less  attractive 
for  this  particular  application  than  originally  believed,  and  odd-even 
reduction  would  be  preferred.  Once  several  reductions  have  been  performed, 
\ and  u.  may  be  sufficiently  small  to  warrant  use  of  Parallel  Gauss  and  its 


variations . 


9.  APPLICATIONS 


In  this  section  we  discuss  two  applications  of  practical  importance 
in  which  the  assumption  ||  B[A]I|  < 1 is  not  met;  in  the  second  case  it  is 
shown  how  to  get  around  this  problem. 


9 .A.  Curve  Fitting 

Cubic  spline  interpolation  of  the  data  (x^,f^),  1 £ i £ N + 1,  gives 
rise  to  the  tridiagonal  system  [C6,  p.  238] 


h.s.  + 2 (h . + h . . ) s . + h . .s,,. 
l i-l  v l L-l  1 1-1  1+1 


= 3(f[xi_1,xi]hi  + f[xi>xi+1]hi_1) . 

2 £ i £ N, 


h.  = x 
l i+l 


x.  . 

i 


We  have  A = (h^,  2(h^  + h^  ^)  , h^  and  |(  B[A]||  = — , which  is  quite  satis- 
factory. However,  more  general  problems  need  not  behave  so  nicely  even 
though  they  might  give  rise  to  block  tridiagonal  matrices. 

Cox  [C7]  presents  an  algorithm  for  least-squares  curve  fitting  with 
piecewise  polynomials  which  must  solve  the  system 


The  v vectors  contain  coefficients  of  the  Chebychev  polynomial  expansion 
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of  the  piecewise  polynomial,  and  the  X vectors  contain  Lagrange  multipli- 
ers derived  through  a constrained  linear  least-squares  problem.  The  C/s 
always  have  full  rank,  and  if  there  are  sufficiently  many  data  points 
between  adjacent  pairs  of  knots  then  the  H/s  will  all  be  positive  defi- 
nite. In  this  case  a s traightf orvard  band  elimination  may  be  used  well 
when  the  number  of  knots  and  data  points  is  large.  When  insufficient  data 
is  available  the  H s will  not  have  full  rank  and  pivoting  is  needed  in 
the  band  elimination.  It  seems  that  the  condition  j|  B[A]||<  1 will  not 
hold  for  any  reasonable  partitioning  or  data.  Indeed,  B[A]  is  not  even  . 
defined  for  the  most  obvious  partition. 


9 .B . Finite  Elements 

In  Section  2 .A  we  noted  that  matrices  constructed  from  finite  element 
approximations  to  differential  equations  often  do  not  satisfy  assumptions 
such  as  jj  B[A]||  < 1.  Thus  some  other  technique  must  be  used  to  establish 
numerical  stability  in  a direct  method  or  convergence  in  an  iterative 
method.  Usually  it  is  shown  that  A is  positive  definite.  The  following 
example,  admittedly  a simple  one,  shows  how  an  easily  implemented  change 
of  variables  can  yield  a new  system  A'x'  = v'  such  that  ||  B[A'  ]||  < 1. 

Even  if  the  change  of  variables  is  not  actually  made,  this  will  show  that 
II  B[A]  It*,  < 1 for  some  norm  ||  • ||^,  and  also  serves  to  establish  numerical 
stability. 

Let  us  begin  with  the  equation  -u"  = 1 on  the  unit  interval.  Using 
Hermite  cubics,  Strang  and  Fix  [S4,  p.  59]  derive  the  4th  order  accurate 


finite  element  equations 
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, u . , , - 2u  . + u . , . u'  - u' 

-4(-l±i r1 ti)  + h - J'h  - 1, 


(9.1) 


, u , , - u . , 

-I/_ J±i 111 

5 ' 2h 


2h 


> + I"j  - Hi ' 2ui  + "i-i* 


= 0 


Taking  the  natural  choice  x^  = (u ^ , u^)  , we  obtain  the  block  equation 


aXj_1  + bxj  + cXj+1  = (1,0) 


The  matrix  A = (a,  b,  c)  is  positive  definite,  but  ||  B[A]||  = — + 3 ' (4h)  . 
The  problem  is  simply  that  the  unknowns  are  out  of  scale. 

A second  set  of  unknowns,  which  are  in  scale,  is 

T h h T T 

x.  = (u.  - TT  u'.  , u.  +-  u'.)  aa  (u.  u.ll/o)  . Since  the  Hermite  cubi 

j j 2 j*  j 2 j J-12  j+1/2 

can  be  derived  from  B-splines  by  coalescing  adjacent  nodes,  these  unknowns 
are  actually  a more  natural  choice  than  the  original.  We  now  have 


res 


'1  -r 

,1  1, 


'1  0 
v0  h/ 


u . 
J 


and,  multiplying  the  first  equation  of  (9.1)  by  h and  the  second  by  h, 


2 T 

aXj  ^ + bXj  + cxj+^  = (h  • 1,  h • 0)  , 
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The  blocks  are  independent  of  h and  we  have  |!  B[A]|j  = 1,  with  o^(B[A])  < 1 
for  the  first  and  last  block  tows.  (Actually,  this  assumes  Dirichlet 
boundary  conditions ; Neumann  conditions  give  p^(B[A])  = 1.) 

For  the  more  general  equation  -u"  + Qu  = f,  Q a constant,  the  follow- 
ing equations  are  derived  [S4,  p.  59]: 


-L  (.36  u.^  - 3h  u'_L  + 72  u . - 36  uj+1  + 3h  uj+1) 

+ §0(54  Uj-1  + 13  h uj.!  + 312  u.  + 54  u.+1  - 13h  u^) 
= F j * 

io(3  uj-i  -huj-i + 8h  uj  • 3 Vi  -by 


+ io(-13  uj-l  - 3h  uj-l  + 8hu^  13  uj+1  - 3 h u'+1) 

= F'.. 

J 


Again  taking  = (u^  “ ^ u j > Uj  + ^ uj)T  anc*  suitably  scaling  the  equa- 
tions, they  become 


ax . . + bx . + cx . , = (hF.,F'.) 
J-l  J J+l  J j 


2 

where  now  the  blocks  are  0(h)  perturbations  of  the  blocks  in  (9.2).  Omit- 

5 2 U 

ting  details,  for  h sufficiently  small  this  gives  ||  B[A]||  = 1 - — Qh  + 0(h  ). 
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Since  rescaling  and  change  of  variables  in  the  linear  system  is  equiv- 
alent to  a change  of  basis  in  the  original  function  subspace,  the  process 
should  be  applicable  to  more  general  differential  equations  solved  by 
finite  element  methods.  The  underlying  technique  is  to  prove  something 
for  spline  approximations,  and  show  that  the  property  carries  over  in  the 
collapse  of  nodal  points  yielding  the  Hermite  cubic  approximations. 

Strictly  in  terms  of  modifying  linear  systems  by  a block  diagonal 
scaling,  it  is  natural  to  ask  if  either  of  the  following  problems  are 
solvable  in  general: 

1.  Given  an  arbitrary  block  tridiagonal  matrix  A,  find  a block 
diagonal  matrix  E such  that  ||  B[AE]||  < 1. 

2.  In  1.,  if  ||  B[A]  ||  < 1,  find  E such  that  ||  B [ AE ] 1 1 < ||  B[A]||. 

These  problems  are  closely  related  to  the  optimal  diagonal  scaling  of 
matrices  to  minimize  condition  numbers,  and  as  such  are  outside  the  scope 
of  this  thesis. 
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10.  IMPLEMENTATION  OF  ALGORITHMS 

This  section  contains  some  general  remarks  on  the  choice  of  algorithm 
for  particular  applications,  and  a more  detailed  comparison  of  algorithms 
for  use  on  a vector  computer.  In  the  comparison  we  assume  uniform  nxn 
block  sizes,  later  specialized  to  n = 2.  All  algorithms  will  use  the  same 
data  layout,  assumed  to  be  core-contained,  and  the  execution  time  compari- 
son will  be  in  two  parts:  first  for  a very  simplified  and  idealized  model 
in  which  the  machine  vectors  are  defined  as  storage  locations  in  arithmetic 
progression,  and  second  for  a model  in  which  the  machine  vectors  are  defined 
as  contiguous  storage  locations  (arithmetic  progression  with  increment  = 1) . 
The  first  model  uses  information  for  the  CDC  STAR- 100  ignoring,  among  other 
things,  the  fact  that  machine  vectors  on  STAR  must  satisfy  the  conditions 
for  the  second  model.  This  double  comparison  allows  us  to  pinooint  the 
effect  of  data  manipulation  required  by  the  stringent  requirement  on  vec- 
tor operands.  For  a description  of  STAR  facilities  necessary  for  this  dis- 
cussion, see  Section  2.B  and  Appendix  A.  Implementation  of  algorithms  for 
tridiagonal  linear  systems,  n = 1,  is  adequately  discussed  elsewhere  ([H4], 
[Jl],  [L2],  [Ml],  [S2])  and  these  references  may  be  considered  as  back- 
ground material  for  this  section. 

10. A.  Choice  of  Algorithm  (Generalities) 

The  salient  point  of  our  analysis  has  been  that  a number  of  algorithms 
may  be  applied  to  block  tridiagonal  linear  systems  that  are  likely  to  arise 
in  practice.  Sometimes  it  may  be  possible  to  take  advantage  of  special 
features  in  order  to  reduce  computing  costs  or  estimate  error.  When  the 
computing  costs  are  high  (such  as  one  enormous  problem  or  a large  problem 
solved  repeatedly)  an  extra  effort  must  be  made  to  choose  the  most 
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efficient  algorithm.  This  choice  will  be  difficult  to  make  in  general, 
but  we  feel  that  it  is  not  at  all  difficult  to  choose  good  algorithms 
in  many  cases. 

Regardless  of  the  computer  involved,  algorithms  for  three  special 
problems  should  be  implemented  before  proceeding  to  algorithms  for  more 
general  problems.  The  first  should  deal  with  general  block  tridiagonal 
matrices  restricted  to  blocks  of  a small  uniform  size,  say  no  more  than 
8x8.  On  a standard  sequential  computer,  band  or  profile  methods  are  ap- 
propriate; on  a k-parallel  computer  some  form  of  block  LU  factorization 
should  be  used,  and  on  a vector  computer  a combination  of  odd-even  reduc- 
tion and  LU  factorization  should  be  used.  Storage  allocation  and  access, 
vector  operations  on  a parallel  or  vector  computer  and  implementation 
issues  in  general  can  be  greatly  simplified  under  the  block  size  restric- 
tion. Options  include  implementation  as  a set  of  subroutines  for  specific 
block  sizes,  insertion  of  tests  for  use  as  a semidirect  method,  and  spe- 
cial scalings  or  pivoting  strategies  for  efficiency  or  stability.  This 
restricted  class  of  algorithms  may  be  applied  directly  to  small-block 
problems  or  used  as  a module  in  the  iterative  solution  of  large  sparse 
problems  as  indicated  in  Section  8.  The  experience  gained  may  or  may  not 
be  useful  for  implementation  of  methods  for  moderate-sized  blocks,  which, 
in  our  opinion,  should  be  treated  in  a different  manner. 

The  best  choice  of  a parallel  algorithm  for  moderate- sized  blocks  is 
not  clear  at  this  time.  These  cases  represent  a cross-over  between  effi- 
cient vector  implementations  of  band/profile  methods  and  the  odd-even 
methods.  Either  one  is  likely  to  be  satisfactory.  In  Lambiotte's  opinion 
[LI]  the  band  methods  are  better  for  the  CDC  STAR,  but  this  choice  is 


heavily  influenced  by  vector  addressing  restrictions  on  the  STAR  (cf.  our 
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remarks  in  Section  10. B) . Other  parallel  or  vector  computers  may  not 
present  such  problems,  and  a more  complete  analysis  is  needed. 

The  third  basic  class  of  algorithms  are  the  fast  Poisson  solvers, 
including  generalizations.  Recent  studies  have  shown  that  many  finite  dif- 
ference approximations  to  elliptic  equations  can  be  computed  very  effi- 
ciently by  iterative  or  modification  methods  based  on  the  Poisson  solvers. 
There  are  several  kinds  of  algorithms  to  choose  from,  and  all  have  consid- 
erable inherent  parallelism.  Other  algorithms  are  available  for  elliptic 
equations,  but  flexibility  and  ease  of  implementation  suggest  that  Poisson- 
based  algorithms  should  be  among  the  first  to  be  considered.  Brandt's 
multi-level  adaptive  methods  [B131  must  also  be  seriously  considered. 

Finally,  we  note  that  matrix  partitioning  and  block  elimination  is  a 
useful  technique  for  dealing  with  matrices  so  large  that  secondary  storage 
must  be  used  [R4].  Known  as  the  hypermatrix  scheme  in  other  contexts,  it 
has  been  extensively  exploited  by  the  A SKA  system,  a structural  analysis 
package  developed  at  the  University  of  Stuttgart  [F3].  Noor  and  Voigt  [Ml] 
discuss  implementation  of  hypermatrix  schemes  on  the  CX  STAR- 100.  The 
method  works  as  follows:  suppose  a partition  tt  is  given  (see  Section  2. A), 
and  let  A^  be  the  matrix  A partitioned  accordingly.  Block  sizes  in  ASKA 
are  chosen  to  optimize  transfers  between  storage  levels,  while  the  choice 
on  STAR  must  also  try  to  maximize  vector  lengths  and  hence  minimize  start- 
ud  costs.  The  usual  situation  is  that  a few  blocks  of  A will  be  in  main 

TT 

memory,  while  the  bulk  of  A^  will  be  in  secondary  memory.  For  example,  if 
tt'  is  another  partition  such  that  rr  refines  rr'  , we  would  perform  block 
elimination  on  A^, x^,  = v , . When  A^  is  block,  tridiagonal  so  is  A^,  . 

Since  the  data  requirements  for  the  LU  or  Cholesky  factorization  are  then 
highly  localized,  this  is  a natural  procedure  [E3].  Moreover,  the  diagonal 


113 


blocks  of  A^,  will  themselves  be  block  tridiagonal,  though  possibly  of  low 
order.  As  noted  in  Sections  2 .A  and  3. A. 2,  if  ||  BfA^]  ||  < 1 then  ||  B[A_r,]|| 
£ ||  jfA^]!!,  and  pivoting  will  be  restricted  to  those  blocks  of  A , resi- 
dent in  main  memory.  This,  of  course,  is  precisely  what  is  desired. 

10. B.  Storage  Requirements 

Data  layout  is  an  essential  part  of  planning  an  algorithm' s imple- 
mentation on  any  parallel  computer,  since  computational  speed  can  be 
severely  decreased  if  data  cannot  be  accessed  conveniently.  On  a vector 
computer  this  means  that  conceptual  vectors  must  conform  to  the  machine's 
definition  of  a vector  operand.  We  consider  some  consequences  of  this 
requirement  for  implementation  on  the  CDC  STAR- 100,  assuming  that  the  prob- 
lem is  core-contained. 

Storage  available  on  the  STAR,  for  64-bit  words,  consists  of  256  reg- 
isters and  either  512K  or  1M  words  of  main  memory.  A sizable  secondary 
disc  storage  is  also  available  but  will  not  be  considered  here.  A vector 
on  STAR  consists  of  s consecutive  storage  locations,  1 £ s £ 64K. 

Our  data  arrangement  for  block  tridiagonal  linear  systems  with  uniform 
nxn  blocks  is  illustrated  in  fig.  10.1  for  the  case  n = 2,  N = 7.  Define 
m =*  rlog7N"  , N*  = 2m,  and  if  N < N*  then  extend  Ax  = v with  N*  - N trivial 

equations  x^  =*  0.  The  matrix  A and  vector  v can  then  be  held  in  three 

2 

storage  vectors  of  length  n N*  and  one  of  length  nN*.  Taking  the  subdiag- 

2 

onal  blocks  as  an  example,  the  n components  of  a,  are  a. . , 1 £ i s n, 

k 1 j , k 

lsj5n,  1 s k £ N*,  and  the  storage  vector  a is  defined  by 

a((i  - l)N*n  + (j  - 1)N*  + k)  =*  a , . Thus  a total  of  (3n^  + n)N*  loca- 

l j , K 

tions  are  needed  to  store  the  original  system.  We  do  not  consider  the 
cost  of  setting  up  the  storage  vectors  since  each  of  our  algorithms  will 
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Fig.  10.1.  Data  layout  for  n=2,  N = 7 , N*  = 8 . 
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use  the  same  arrangement  and  since  the  details  can  very  widely  between 
systems.  This  particular  data  layout  has  been  chosen  because  it  makes  the 
odd-even  reduction  algorithm  more  efficient  for  the  true  STAR  model,  while 
the  other  competitive  algorithms  (LU  factorization,  odd-even  elimination) 
are  mostly  unaffected  by  the  choice  of  data  layout,  and  can  do  without 
extra  storage  for  trivial  equations. 

Since  the  STAR  forces  vector  lengths  to  be  £ 64K,  we  must  have 

? 

n“N*  ^ 64K  = 65,536.  For  a given  value  of  n,  this  restricts  the  system 
size  as  follows: 


n 

2 

3 j 

1 4 

| 5 

1 6 

| 7 

8 

64R/n2 

16384 

7281 

4096 

| 2621 

1826 

1337 

1024 

max  N* 

16334 

4096 

1 

4096 

2048  1 

1 1 

1024 

1 

1024 

1024 

These  will  probably  not  be  serious  restrictions  in  most  cases. 

A more  important  restriction  on  system  size  is  our  assumption  that  the 
problem  and  solution  will  remain  core-contained.  Assuming  a 512K  main 
store,  and  supposing  that  temporary  storage  plus  the  program  amounts  to T 
times  the  original  storage,  we  obtain  the  following  restrictions  on  N: 


n 

2 

3 

4 

5 

6 

7 

| 8 

512K  T-  1 

18724 

8738 

5041 

3276 

2279 

1702 

1310 

(T  4-  1)  (3n  + n)  T = 2 

12483 

5825 

3360 

2134 

1533 

1134 

873 

T = 3 

9362 

4369 

2520 

1638 

1149 

851 

655 

max  N*  T ■ 1 

16384 

8192 

4 096 

2048 

2048 

1024 

1024 

T = 2 

8192 

4096 

2048 

2048 

1024 

1024 

512 

T = 3 

3192 

4096 

2048  I 
. 

1024 

1024 

512 

512 

116 


1 


The  amount  of  temporary  storage  used  depends  on  the  algorithm  and  the 
extent  to  which  A and  v are  overwritten.  It  can  range  from  practically  no 

9 

temporary  storage  (other  than  0(n“)  registers)  for  the  LU  factorization  to 
T sa  m for  an  inefficient  implementation  of  odd-even  elimination.  Odd-even 
reduction  would  typically  have  1^2. 

We  now  consider  the  costs  of  data  manipulation  for  the  basic  algorithms. 
Major  considerations  are  the  speed  of  the  load  store  unit  on  STAR,  for  reg- 
ister-memory transfer  of  scalars,  and  the  requirement  that  vectors  be  stored 
as  contiguous  memory  locations. 

The  LU  factorization,  executed  mostly  in  scalar  mode:  The  only  data  manip- 
ulation is  to  load  store  between  memory  and  the  register  file.  To  a con- 
siderable extent  this  may  be  overlapped  with  arithmetic  operations.  A 
complete  analysis  requires  cycle  counting  in  an  assembly  language  program 
and  timings  from  actual  runs.  Our  estimate  of  execution  time  (based  on  an 
untested  cycle  count)  is  that  the  algorithm  will  probably  be  I/O  bound. 

With  this  in  mind,  we  suggest  four  possible  implementations:  (Times  given 
are  for  n = 2,  and  represent  minimal  run  times  assuming  complete  overlap 
between  scalar  arithmetic  and  load  stores.  Actual  run  time  will  certainly 
be  greater.  For  details,  see  Appendices  A,  B.) 

Version  1,  a one-time  solution,  all  scalar  mode 

time  a 50SN  - 362 

Version  2,  factor  + solve,  partial  vector  mode 

factor,  version  a,  time  a 324N  + 3421  (use  for  N > ~ 38) 
version  b,  time  a 374N  + 1769 
version  c,  time  a 420N  - 230  (use  for  N < — 33) 
solve,  time  a 303N  + 1039 
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1 


The  three  versions  of  factorization  use  increasingly  fewer  vector  opera- 
tions and  increasingly  more  scalar  operations.  If  more  than  one  system 
must  be  solved  with  the  same  matrix  A,  it  is  best  to  use  Version  2.  Our 
execution  time  comparison  in  Section  10. C will  use  Version  1. 

Odd- even  elimination,  executed  in  vector  mode:  An  important  feature  of 
this  algorithm  and  any  data  layout  like  ours  based  on  component  vectors 
is  that  the  conceptual  vectors  all  correspond  to  the  strictest  require- 
ments for  machine  vectors,  and  hence  no  additional  data  manipulation  is 
needed.  (cf.  [Ml]  for  the  tridiagonal  case.) 

Odd- even  reduction,  executed  in  vector  mode:  If  the  vector  computer  model 
allows  vectors  to  be  storage  locations  in  general  arithmetic  progression, 
then  no  additional  data  manipulation  is  needed.  In  a true  model  of  the 
STAR,  however,  we  need  to  perform  two  basic  operations:  (1)  separate  a 
vector  into  its  even  and  odd  components,  and  (2)  rejoin  the  even  and  odd 
parts  of  a separated  vector.  The  first  is  accomplished  with  two  STAR 
compress  instructions,  the  second  with  a STAR  merge  instruction.  The  odd- 
even  separation  is  needed  before  each  reduction  step  can  be  performed,  and 

must  be  applied  to  vectors  a,  b,  £,  and  v.  We  illustrate  the  separation 

2 

for  b in  fig.  10.2.  Having  set  up  a control  vector  z consisting  of  n N*' 2 
copies  of  the  bits  10  (a  single  vector  operation  on  STAR) , we  execute 

compress  b ->  bl(l)  per  z 

b(2j  - 1)  - bl(j) , (lSjS  |n2N*) 

12 

compress  b -*  M(— n N*)  per  z 

b(2j)  - bl(jn2N*  + j),  (1  * j s jn2U*). 

2 112 

The  cost  for  each  compress  is  n N*  4-  -t-(— n N*)  + 92  and  the  total  cost  to 

o 2 
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Fig.  10.2.  Odd-even  separation  for  n=2,N=7,  N*=8. 
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17  2 

odd-even  separate  a,  b,  c,  and  v is  — (3n  + n)N*  + 736.  Our  initial 

o 

extension  of  A to  N*  block  equations  makes  it  possible  to  perform  this 

? 

step  using  only  3 vector  compress  operations,  independent  of  n“ . If  A 

2 

had  not  been  extended  we  would  need  either  2 (3n  + n)  vector  operations 

2 

to  compress  the  vectors  one  component  subvector  at  a time,  or  n vector 

2 

operations  to  generate  a modified  control  vector  consisting  of  n copies 
of  z(l:N),  followed  by  the  original  3 vector  compress  operations. 

Lambiotte  [LI]  rejected  use  of  odd-even  reduction  for  n > 1 on  this  basis. 

The  net  effect  of  our  extended  data  layout  is  to  reduce  the  number 
of  vector  operations  needed  to  manipulate  vectors  for  the  STAR,  and  thus 
to  reduce  vector  startup  costs.  On  the  other  hand,  we  have  increased  vec- 
tor lengths  and  hence  storage  from  N to  N*,  and  this  may  be  a more  impor- 
tant consideration  in  some  cases.  The  modified  control  vector  approach 

would  then  be  most  natural  despite  the  increased  startup  costs.  In  any 
3 

case,  0(n  ) vector  operations  are  needed  for  the  arithmetic  part  of  the 
reduction  step. 

So  far  we  have  considered  only  the  odd-even  separation  to  prepare 

for  the  first  reduction.  In  general,  the  cost  of  the  separation  to  pre- 

17  2 

pare  for  the  ith  reduction,  1 £ i £ m - 1,  is  — (3n  + n)N*  + 736, 

o L 

ITt  = 2m+^  and  the  total  cost  for  m- 1 reductions  is  -^-(3n“  + n)  (N*  - 2) 

+ 736(m-l).  For  the  case  n = 2 this  yields  59. 5N*  + 736m  - 855.  We  will 
see  shortly  that  this  is  roughly  half  as  much  time  as  is  spent  in  the 
arithmetic  operations,  so  the  overhead  of  dealing  with  a restrictive 
machine  vector  definition  can  be  considerable. 

Data  manipulation  for  the  back  substitution  is  simplified  by  saving 
the  odd- indexed  equations  of  A^  in  appropriate  form  for  vector  operations 
(see  fig.  10.2  for  the  final  state  of  b) . Having  computed  x^i+^  = the 
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even  components  of  x^^ , we  compute  the  odd  components  of  x^^  and  merge 
the  two  sets  into  x^  . The  merge  operation  on  STAR  is  comparatively 
more  expensive  than  the  compress  operation,  but  we  are  only  dealing  with 
the  solution  vector  and  not  blocks  of  A.  The  cost  of  the  merge  operation 
at  stage  i is  3n.\\*  + 123,  and  the  total  cost  for  m- 1 reductions  is 
6n(N'v  - 2)  + 123(m  - 1) . For  n = 2 this  yields  12N'V  + 123m  - 147. 


p-fold  reduction:  Suitable  data  arrangement  and  use  of  control  vectors 
for  p-fold  reduction  can  readily  be  generalized  from  our  discussion  of 
odd-even  reduction.  The  odd- even  separation  generalizes  to  separation 
into  p vectors,  which  requires  p compress  instructions  per  storage  vector; 

^ 2 jjyL  1 » j 

the  cost  per  reduction  would  be  (p  + — ) (3n“  + n)N.*  + 363p,  N.*  - p‘  , 

O 1 1 

m = log  N 1 . The  back  substitution  requires  merging  d vectors  into  1, 

1 1 

using  p-1  merge  instructions  at  cost  of  3 - — lnN.*  + 123  (p- 1)  to  go 

, (i+1)  (L)  _ . , , , ..... 

trom  x to  x . It  is  clear  that  these  costs  are  minimized  bv 


Jacobi  iterations:  Data  manipulation  is  unnecessary  when  A and  v or  BrA' 
and  D[A]  ^v  are  stored  according  to  our  component  subvector  scheme. 


10. C.  Comparative  Timing  Analysis. 

We  close  with  an  execution  time  comparison  of  algorithms  for  general 
block  tridiagonal  systems  with  2x2  blocks,  based  on  timing  information  for 
the  CDC  STAR- 100.  This  will  expand  the  crude  comparison  of  the  LU  factor- 
ization and  odd-even  reduction  given  in  Section  4.D,  and  illustrates  many 
earlier  statements  concerning  the  cost  of  various  methods.  Essential  con- 
siderations include  vector  startup  costs,  creating  a substantial  penalty 
for  operations  on  short  vectors,  and  data  manipulation  to  deal  with 
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restrictive  definitions  of  vector  operands.  Section  10. B began  our  dis- 
cussion of  the  latter  problem. 

Our  initial  simplified  analysis  is  based  on  the  following  conven- 
tions, which  allow  us  to  consider  only  arithmetic  costs: 

1.  we  assume 

a.  vector  operands  may  be  any  sequence  of  storage  locations 
in  arithmetic  progression, 

b.  pivoting  is  never  necessary; 

2 . we  ignore 

a.  instruction  overlap  capabilities, 

b.  address  calculations,  loop  control  and  termination  tests, 

c.  load/store  costs  for  scalar  operands, 

d.  storage  management  costs,  as  a consequence  of  assumption  l.a. 

Our  second,  and  more  realistic  analysis  will  demand  that  vector  operands 
consist  of  consecutive  storage  locations,  so  we  must  consider  storage  manage- 
ment as  explained  in  Section  10. B.  Overlap,  loop  control  and  scalar  load/ 
store  will  also  be  considered,  but  not  address  calculation  or  termination 
of  iterations.  Algorithms  and  timing  derivations  are  given  in  Appendix  B; 
here  we  present  the  main  results  and  make  comparisons. 

Execution  times  for  the  basic  direct  methods  are  initially  estimated 
as  follows  (m  * ’’log2N?): 

1.  the  block  LU  factorization,  using  scalar  operations  only  (LU) : 

1012N  - 7 84 

2.  odd-even  elimination,  using  vector  operations  only  (OEE) : 

94 . 5Nm  + 12999m  - 93. 5N  - 1901 


r 
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3.  odd-even  reduction,  using  vector  operations  except  for  the  final 

block  system  (OER) : 106. 5N  + 14839m  - 14717.5 

4.  p-fold  reduction,  p 2 3: 

d-  1 


145N  + 19206m  (■ 


log2p 


) - ( 1935 lp  - 19579) 


We  can  first  dispose  of  p-fold  reduction  as  a competitive  method  for  gen- 
eral problems  since  either  LU  or  OER  is  always  cheaper.  Execution  time 
ratios  show  the  benefits  of  using  OER  in  preference  to  LU  and  OEE. 


N = 1023 

N = 8191 

limit,  N — 00 

OER: 

LU 

.23 

.13 

.11 

OER: 

OEE 

.24 

.11 

0. 

The  importance  of  vector  startups  is  seen  by  comparing  OER  and  LU, 
for  the  latter  is  faster  when  N < 100.  OER  works  by  successively  reducing 
vector  lengths,  and  at  some  point  this  becomes  inefficient.  A polyalgorithm 
combining  the  two  methods  is  straightforward:  if  m £ 6 then  use  LU,  other- 
wise do  m-5  reductions,  solve  the  remaining  system  by  LU,  and  back  substi- 
tute. This  reduces  the  time  for  OER  to  106.5  N + 14839m  - 46903.5,  and 
for  N 3 1023  it  represents  a savings  of  13  i. 

We  also  note  that  vector  startups  are  the  dominant  term  for  OER  when 
N £ 1532,  and  are  the  dominant  term  for  OEE  when  N ^ 137.  Of  course,  OEE 
uses  0(N  log  N)  operations  vs.  O(N)  for  the  other  methods  and  soon  becomes 
inefficient.  Nevertheless,  it  is  faster  than  LU  for  455  £ N ^ 1023, 
although  it  is  always  slower  than  OER. 

Our  conclusion  is  that  the  OER  - LU  polyalgorithm  is  the  best  direct 
method  of  those  considered  here. 


Now  consider  the  cost  of  individual  eliminations  or  reductions  with 
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respect  to  the  total  cost  of  the  algorithm.  For  odd-even  elimination, 

each  elimination  step  has  roughly  equal  cost,  the  last  being  slightly 

cheaper.  Thus  each  step  costs  about  Ira  of  the  total  cost.  For  odd-even 

reduction,  the  cost  per  reduction  decreases  as  the  algorithm  proceeds: 
cost  of  the  kth  reduction  as  a fraction  of  total  cost 


1 k ° 1 

2 

3 

4 

5 

6 

total  cost 

N = 1023,  m = 10 

00 

C^l 

17$ 

12$ 

3$ 

7$ 

7$ 

242,622 

N = 3191,  m 3 13 

43* 

22$ 

12$ 

m 

4$ 

3$ 

1,050,531 

(cost  for  kth  red.  = 106.5(2m-k)  + 14839  when  N = 2m  - 1) 


For  small  N,  when  startup  costs  are  dominant,  reduction  costs  are  approxi- 
mately equal,  while  for  large  N,  when  true  arithmetic  costs  are  dominant, 
the  kth  reduction  costs  about  12  of  the  whole.  This  has  immediate  con- 
sequences for  the  effectiveness  of  incomplete  reduction,  since  the  omitted 
reductions  are  only  a small  part  of  the  total  cost  when  N is  large. 

For  our  comparison  of  semidirect  and  iterative  methods,  the  goal  will 
be  to  approximate  x with  a relative  error  of  l/N,  or  to  reduce  the  initial 
error  by  a factor  of  l/N.  Thus  if  we  perform  either  k odd-even  elimina- 
tions or  k reductions,  we  want  ||  B[H^+^]j|  £ l/N  or  ||  B[A^k+^]||  £ I N, 
assuming  jj  B[A]j|<  1.  Theorem  7.1  is  then  applied  to  produce  the  desired 

approximation  by  incomplete  elimination  or  reduction.  Taking  9 = j (B [A ] 1 1, 

k k 

we  have  ||  B[H^+^]||,  ||  B[A^k+1^]j|  £ » and  demanding  3“  ^ l/N  yields 

k s log2((log2N)  / (- log2 9) ) . 

The  execution  time  for  k eliminations  is 

94.5Nk  + 12999k  + 10. 5N  - 105(2k)  + 1205, 


and  the  time  for  k reductions  is 
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106. 5N  + 14839k  - 96(2m'k)  + 1287. 

Typical  savings  for  incomplete  reduction  over  complete  reduction  would  be: 
3 = .9,  pick  k £ log2log2N  + 2.718 


N = 1023, 

k 

= 7, 

savings  ~ 12)5 

N = 8191, 

k 

= 7, 

savings  = 7^ 

2 

8 = , pick  k 

2 log2log2N  + .774 

N = 1023, 

k 

= 5, 

savings  = 25^ 

N = 3191, 

k 

= 5, 

savings  = 12  £ 

0 3 Pick  k 

log. 

log,N 

N = 1023, 

k 

= 4, 

savings  = 33-- 

N = 3191, 

k 

3 4, 

savings  = 16^ 

Comparison  to  the  OER  - LU  polyalgorithm  decreases  the  savings. 

We  next  consider  iterative  methods  and  their  combination  with  semi- 
direct  methods.  One  step  of  the  block  Jacobi  iteration  costs  22. 5N  + 3031, 
while  one  step  of  the  block  Jacobi  semi- iteration  costs  26. 5N  + 3408  plus 
overhead  to  estimate  parameters  such  as  p = o(B[A1).  Execution  time  may 
be  reduced  to  12N  + 1840  (16N  4-  2217)  per  iteration  by  directly  computing 
B[A]  and  D[A]  (cost  = 33. 5N  + 4367  = 3 iterations). 

We  will  use  the  Jacobi  iteration  in  its  simplest  form  when  only  a few 
steps  are  demanded,  and  Jacobi  semi- iteration  when  many  steps  are  needed. 
The  number  of  iterations  needed  by  J-SI  to  reduce  the  initial  error  by  a 
factor  of  1/N  are  estimated  to  be 
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The  LU  factorization  is  always  faster  than  OEE,  and  is  faster  than 
OER  for  N £ 312.  Vector  startup  costs  are  dominant  in  OEE  for  N £ 152, 
and  in  OER  for  N £ 798.  The  OER-LU  polyalgorithm  should  perform  m-6  reduc- 
tions, solve  the  remaining  system  by  LU,  and  back  substitute.  The  time  is 
183N  + 16233m  - 70677,  yielding  a 16$  saving  over  OER  for  N = 1023  but  only 
a 3$  saving  for  N = 3191. 

The  execution  time  for  odd-even  elimination  is  allocated  as  follows: 


! 


N = 1023  N = 3191 


arithmetic 

96$ 

96$ 

overhead 

4 $ 

4$ 

results 

37  $ 

98$ 

vector  startups 

13$ 

2$ 

Analysis  of  the  total  time  for  odd-even  reduction  shows  the  greater  impor- 
tance of  overhead  operations: 


I 


N = 1023 
arithmetic 

overhead 


N =*  8191 
arithmetic 

overhead 


results 

startups 

33$ 

40$ 

73$ 

23$ 

/ d/ 

27$ 

56$ 

results 

startups 

52$ 

11$ 

62$ 

37$ 

1$ 

38$ 

38$ 

12$ 
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limit,  N -*  =° 

results 

startups 

/arithmetic 

53  3 

- 

58% 

(overhead 

42* 

- 

423 

1003 

- 

The  increase  in  time  for  OER  due  to  overhead  is  37*  for  N = 1023,  6l3  for 
N = 8191,  and  723  in  the  limit  as  N -*  ®.  As  a consequence  of  the  data 
layout  of  Section  10. B,  overhead  startup  costs  are  quite  small,  but  the 
overhead  result  costs  cannot  be  ignored  and  probably  cannot  be  decreased. 
The  strong  requirement  that  machine  vectors  consist  of  contiguous  storage 
locations  leads  to  much  "wasted"  effort  in  execution  time. 

Most  of  the  execution  time  of  OER  is  actually  spent  in  the  arithmetic 
and  vector  compress  operations: 

cost  of  individual  operation  as  fraction  of  total  cost 


N = 1023 

N = 8191 

limit,  N -»  “ 

divide 

12* 

12* 

123 

multiply 

43* 

35* 

32* 

add / subtract 

18* 

15* 

14  3 

compress 

203 

29* 

33* 

merge 

43 

6$ 

73 

transmit 

3* 

3* 

3* 

The  compress  operation  assumes  more 

reflecting  its  low  startup  costs. 

importance  as 

N increases,  again 

As  noted  earlier, 

most  of  OER' 

s time  is  spent 

in  the  first  few  reduc 

tions  and  subsequent  back  substitutions. 
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p = .9,  1.5m  iterations, 

2 

p = ^,  .75m  iterations, 

p = -j,  .5m  iterations. 

2 

For  p = — , for  example,  J-SI  is  never  faster  than  OER. 

As  suggested  in  Section  7. A,  we  can  combine  incomplete  elimination  or 

reduction  with  Jacobi  iterations.  If  k steps  of  elimination  or  reduction 

are  used,  followed  by  l iterations  and  back  substitution,  k and  l should 
ok  l 

be  chosen  so  that  3 s L N,  or 

(*)  k + logo ^ 2 log2((log9N)  / (-log9  S) ) . 

The  time  for  the  OER-J  method  would  then  be 

106. 5N  + 14839k  - 96(2m-k)  + 1287  + 2(19(2m_k)  + 2615), 

2 

and  it  only  remains  to  minimize  the  time  subject  to  (*) . For  3 = ■j, 

N = 1023,  the  resulting  values  are  k = 2,  i = 5,  which  represent  a savings 
. 2 

of  37/:  over  OER.  For  3 = — , N = 8191,  the  resulting  values  are  k = 3, 
2=3,  which  represent  a savings  of  15$  over  OER. 

Of  course,  to  determine  k and  i we  need  to  know  3.  While  it  is  true 
that  3 = j|  B[A]j|  can  be  computed  directly  from  A at  a cost  of  46. 5N  + 3881, 
or  from  B[A]  at  a cost  of  15N  + 308,  we  recognize  that  half  the  rows  of 
B[A]  are  available  as  part  of  the  first  odd-even  reduction.  An  estimate 
of  ||  B[A]||  can  be  computed  from  these  rows  at  a cost  of  7.5N  + 308.  For 
large  N even  this  cost  considerably  reduces  any  benefits  gained  from  using 
incomplete  reduction,  and  points  out  the  importance  of  an  analytic  estimate 
of  ||  B[A]||  rather  than  a computational  estimate. 
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This  ends  our  initial  comparison  of  algorithms.  We  now  make  the 
assumption  that  vector  operands  are  consecutive  storage  locations,  allow 
overlap  of  scalar  operations,  and  consider  the  cost  of  loop  control  and 
scalar  load  store  operations.  The  net  effect  is  to  lower  the  time  esti- 
mate for  the  LU  factorization  and  raise  the  time  estimate  for  odd-even 
reduction.  Odd-even  elimination  remains  substantially  unchanged.  Rather 
than  simply  recompute  the  figures  presented  earlier,  which  will  be  done  in 
a few  cases,  we  try  to  indicate  where  the  algorithms  spend  most  of  their 
time. 

Revised  time  estimates  are  given  below.  These  should  give  a better 
indication  of  actual  runtime  on  STAR,  but  are  probably  all  underestimates; 
we  note  changes  from  the  first  set  of  timings,  m = i log^N"!  . 

1.  LU  factorization,  (LU)  600N  - 150 

(a  40 p decrease  due  to  use  of  instruction  overlap) 

2.  Odd-even  elimination,  (OEE)  98.5Nm  4-  13403m  - 101. 5N  - 2261 
(a  4/»  increase  due  to  overhead  costs) 

3.  Odd-even  reduction.  (OER)  183N  + 16233m  - 16108 

(a  10-702  increase  due  to  overhead  costs;  larger  N incurs 
larger  increases). 

Time  ratios  show  the  effect  of  decreasing  the  LU  time  and  increasing  the 
OER  time 


N = 1023 

N = 8191 

limit,  N — “ 

OER : LU 

.54 

.34 

.31 

OER:OEE 

.32 

.17 

0. 
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cost  of  ith  reduction  as  fraction  of  total  cost 


i = 

1 

2 

3 

4 

5 

6 

N = 1023 

33  4 

19# 

12  4 

34 

7 4 

6 4 

d =8191 

45# 

23# 

12  4 

6 i 

2$ 

limit,  N -*  ® 

5016 

25  4 

134 

n 

3 4 

2 i 

Of  the  total  time,  86#  is  spent  in  the  reduction  phase  and  14#  in  the  back 
substitution . 

One  of  the  effects  of  the  OER-LU  polyalgorithm  is  to  reduce  the  impor- 
tance of  vector  startup  costs,  as  seen  below. 


N = 1023 

= 3191 

scalar  time 

14# 

2 4 

vector  time 

864 

98  4 

results 

63* 

91# 

s tar tups 

23  4 

7 4 

startup  costs 

are  already 

small  not  much 

ilete  reduction  and  incomplete 

reduction  + 

results.  (9 

= II  B[A]  ||  ). 

Incomplete  reduction,  savings 

over  OER 

N = 

1023 

N = 3191 

k 

savings 

k 

savings 

= .9  7 

10# 

7 

5 4 

f 5 

i 4 

CM 

5 

9 4 

1 ‘ 

27# 

4 

134 
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Incomplete 

reduction 

+ Jacobi  iteration. 

saving  over  OER 

N = 

1023 

N = 

8191 

k 

J 

savings  k 

l 

sav in^s 

3 •=  .9 

3 

9 

22 1°  4 

6 

8 i 

2 

3 

2 

5 

36^  2 

6 

1315 

_1 

47  ^ 2 

9 

1 

5 

4 

As  seen  earlier,  the  seraidirect  methods  are  best  applied  when  the  omitted 
odd-even  reductions  make  up  a significant  part  of  the  OER  time.  There  is 
considerable  advantage  to  using  incomplete  reduction  plus  Jacobi  iterations 
roughly  doubling  the  savings  obtained  by  simple  incomplete  reduction. 


APPENDIX  A 


VECTOR  COMPUTER  INSTRUCTIONS 

This  section  describes  those  features  of  the  CDC  STAR- 100  needed  for 
our  comparison  of  algorithms  in  Section  10.  Information  given  here  is 
derived  from  CDC  reference  manuals  and  timing  measurements  made  at  Lawrence 
Livermore  Laboratory  [C2],  and  by  no  means  represents  a complete  discussion 
of  the  STAR'S  facilities. 

Execution  times  are  for  full-word  (64  bit)  floating  point  instructions, 
given  as  the  number  of  40  nanosecond  cycles  needed  for  each  instruction. 

All  times  should  be  accurate  to  within  5^  except  as  noted,  though  actual 
performance  may  be  degraded  somewhat  due  to  memory  conflicts. 

A vector  on  STAR  consists  of  N consecutive  storage  locations, 

1 £ N £ 65,536.  The  vector  length  restriction  is  usually  not  serious,  but 
the  need  for  consecutive  storage  is  crucial.  Other  vector  computers,  the 
TI  ASC  in  particular,  do  not  have  this  restriction.  Both  machines  are 
pipeline  computers  with  the  usual  scalar  instructions  as  well  as  vector 
instructions . 

Useful  features  of  vector  instructions  on  STAR  include: 

broadcast  constants  - Most  vector  instructions  allow  a scalar  (the  broad- 
cast constant,  held  in  a register)  to  be  transmitted  repeatedly  to  form 
one  or  both  of  the  vector  operands.  Use  of  this  facility  decreases  exe- 
cution time  in  a few  cases. 

control  vectors  - Bit  vectors  can  modify  the  effect  of  vector  instructions, 
such  as  suppressing  storage  to  selected  components  of  the  result  vector 
or  determining  which  operand  components  are  to  be  used  in  an  operation. 
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Use  in  vector  data  manipulation  (compress  and  merge)  is  explained 
later. 

offset  - Modification  to  the  base  address  of  operands. 

sign  control  - Certain  vector  operations  allow  the  sign  of  the  operands 
to  be  modified  before  execution. 

Individual  classes  of  instructions  follow. 

scalar  (register)  instructions.  Issue  time  is  the  number  of  cycles  required 
to  initiate  execution.  Shortstop  time  is  the  time  after  issue  when  the 
result  is  first  available,  but  not  yet  placed  in  a register.  The  result 
must  be  used  at  the  precise  time  given;  if  not,  the  result  cannot  be  used 
until  it  is  placed  in  a register.  Result-to-register  time  is  the  number 
of  cycles  after  issue  when  the  result  is  placed  in  a register.  Overlapped 


execution  is  possible,  but  the  exact 

effect 

on  execution 

time  is  bes 

determined  by 

actual  testing,  which  we  have 

not  done. 

STAR 

result 

instruction 

operation 

issue 

shortstop  to 

regis  ter 

62 

add,  normalized 

2 

6 

11 

66 

subtract,  normalized 

2 

6 

11 

6B 

multiply,  significant 

2 

10 

15 

6F 

divide,  significant 

3 

- 

43 

73 

square  root 

3 

- 

71 

78 

register  transmit 

2 

4 

9 

scalar  (register)  load/store  Instructions.  Up  to  three  load/store  instruc- 


tions may  be  stacked  in  the  load/store  unit  at  any  given  time.  The  stack 
is  processed  sequentially  at  a rate  of  19  cycles  per  load  and  16  cycles 
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per  store  minimum,  assuming  no  memory  conflicts.  Once  a memory  bank  has 
been  referenced  it  remains  busy  for  31  cycles;  during  this  time  further 
references  to  the  bank  are  delayed  and  the  conflict  will  affect  references 
to  other  memory  banks.  Start  to  finish  for  a single  load  requires  at 
least  31  cycles,  and  a minimum  of  66  cycles  are  required  to  store  an  item 
and  reload  it. 


STAR 

instruction 

operation 

issue 

result 
to  register 

load  store 
unit  busy 

7E 

load 

3 

28 

19 

7F 

s tore 

3 

16 

branch  instructions . Execution  time  for  a branch  can  range  from  8 to  34 
cycles,  depending  on  the  particular  instruction,  whether  a branch  is  actu- 
ally made,  whether  the  target  instruction  is  in  the  instruction  stack,  etc. 

We  will  be  only  slightly  pessimistic  if  we  take  40  cycles  to  be  the  time 
for  a typical  branch. 

vector  operation  instructions.  Execution  time  consists  of  an  initial  start- 
up time  followed  by  sequential  appearance  of  result  vector  components,  or 
startup  time  followed  by  processing  time  of  operands  if  a scalar  result  is 
generated.  Vector  instructions  may  not  be  overlapped  except  for  a negligible 
part  of  the  startup  time.  In  the  table  below,  N is  the  length  of  the  input 
vector(s)  and  V is  the  number  of  omitted  items  due  to  use  of  a 
tor.  Times  for  instructions  D8  - DC  may  vary  as  much  as  +20^. 


control  vec- 
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STAR 

vector 

ins  truction 

operation 

time 

effect 

82 

add,  normalized 

+ 71 

c . — a . + b . 

Ill 

86 

subtract,  normalized 

jN  + 71 

c . — a . - b . 

ill 

8B 

multiply,  significant 

N +159 

c . — a . x b . 

ill 

SF 

divide,  significant 

2N  +167 

c . «-  a . + b . 

ill 

93 

square  root 

2N  +155 

c.  - *jr. 

i i 

98 

transmit 

jN  + 91 

c . *-  a . 
l i 

99 

absolute  value 

|n  + 91 

c . «-  j a . ] 
l 1 l 1 

D8 

maximum 

6N  - 2V  + 95 

c «-  max  a , 

N 1 

DA 

summation 

6.5N  - 2V  + 122 

c - a . 

i-1  1 

DB 

product 

6N  - V + 113 

c - "N  a. 

i=l  i 

N 

DC 

inner  product 

6N  - V + 130 

c - a b 

i=  1 

vector  data 

manipulation  instructions. 

The  compress  instruction  transmi 

selected  components  of  one  vector  into  a shorter  result  vector.  Given  a 


vector  a and  a control  vector  z , the  instruction  essentiallv  executes  the 
I code 

j - 1 

for  i = 1,  2 , . . . , N 

if  zi  = 1 then  £c ^ j - j+l}. 

The  result  vector  c has  length  bitsum(z) . The  merge  instruction  "shuffles" 
[ two  vectors  into  a third,  according  to  the  control  vector: 
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length (c)  = length(z)  = length (a)  + length(b) 
j - l;  k - 1 


for  i = 1,  2 

if  z.  = 
i 


length (c) 

1 then  (V  - a ; j - j+l} 
else  {Ci  - b ; k - k+l} 


In  the  table  below,  = length (a) , ^ = lengthfc),  R - number  of  input 
items  broadcast  from  register. 


STAR 

instruction 

BC 

BD 


operation 

onpress  a ->  c per  z 
merge  a,b  - c per  z 


time 

M1  + &2  + 92 
3^  - 2R  + 123 


Execution  times  for  these  instructions  may  vary  as  much  as  +20 
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APPENDIX  B 

SUMMARY  OF  ALGORITHMS  AND  TIMINGS 


We  collect  the  algorithms  discussed  in  the  main  text,  and  present 
timing  formulae  in  general  terms  for  a vector  computer.  This  is  then 
specialized  to  block  tridiagonal  systems  with  2x2  blocks  using  timing 
information  for  the  CDC  STAR- 100.  A comparison  of  methods  based  on  the 
latter  set  of  timings  is  given  in  Section  10. C.  It  must  be  emphasized 
that  our  time  estimates  have  not  been  verified  by  direct  testing  on  STAR. 

The  algorithms  solve  Ax  = v,  A = (a . , b . , c.)  with  blocks  of  uni- 

j j j N 

form  size  n X n.  The  basic  arithmetic  operations  are:  (S  = add  or  sub- 
tract, M = multiply,  D = divide) 


1.  factor  an  n x n block  b 

12  13  o 

cost  = — (n  - n)D  + — (2n  - 3n  + n) (M  + S) 

2.  having  factored  b,  solve  bg  = f,  where  f is  n x n' 

2 

cost  * n'  (nD  + (n“  - n)  (M  4-  S)) 

3.  product  of  n x n and  n x n'  blocks 

2 2 

cost  = n' (n  M + (n  -n)S) 

4.  difference  of  two  n x n'  blocks 
cost  = n' (nS) 


Depending  on  context,  the  symbols  S,  M,  D can  mean  either  a single 
scalar  operation  or  a vector  operation  with  vector  length  specified  by 
the  algorithm. 

For  each  algorithm  we  give  a code  in  terms  of  blocks  of  A,  since  the 
translation  to  individual  scalar  or  vector  operations  is  generally  obvious. 
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In  those  cases  where  additional  detail  seems  to  be  necessary  it  is  pro- 
vided. Two  basic  sets  of  timings  are  given,  five  formulae  in  all  for 
each  algorithm: 

1.  counting  arithemtic  operations  only,  assuming  machine  vectors 
can  be  any  sequence  of  storage  locations  in  arithmetic  progres- 
sion, and  ignoring  pivoting,  instruction  overlap,  address  cal- 
culation, loop  control,  termination  tests  for  iterations,  load 
store  costs  for  scalar  operands,  and  any  other  forms  of  storage 
management 

(a)  for  systems  with  n X n blocks  in  general  terms  for  a 
vector  computer 

(b)  for  systems  with  n x n blocks,  with  timing  data  for 
the  CDC  STAR- 100 

(c)  for  systems  with  2x2  blocks,  with  STAR  data 

2.  a more  realistic  model  of  STAR,  requiring  machine  vectors  to  be 
contiguous  storage  locations,  allowing  overlap  of  scalar  opera- 
tions, counting  storage  management  and  load'store  costs,  and 
loop  control  to  some  extent,  but  still  ignoring  Divoting,  ad- 
dress calculations  and  termination  tests  for  iterations 


(a) 

for 

systems 

with  n 

X n blocks. 

with 

STAR 

data 

(b) 

for 

sys  terns 

with  2 

X 2 blocks. 

with 

STAR 

data. 

Additional  notes  are  provided  in  many  cases.  Timings  1(c)  and  2(b)  are 
used  in  Section  10. C. 

A few  remarks  on  the  timings  are  necessary.  Since  we  eventually  want 
estimates  for  2x2  blocks,  the  only  loop  control  that  will  be  counted  is 
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that  on.  the  main  loops.  Operations  on  n x n matrices  or  n- vectors  are 
assumed  to  be  expanded  into  straightline  code. 

The  address  calculations  that  have  been  ignored  here  would  be  exe- 
cuted in  scalar  mode,  and  probably  can  be  completely  overlapped  with 
scalar  arithmetic  operations  in  the  LU  factorization.  Their  effect  on 
the  odd-even  methods  would  be  to  increase  the  0(ml  terms,  m = "log^N-1  , but 
this  probably  represents  only  a small  relative  increase  since  the  0(m) 
terms  consist  of  vector  startup  costs  and  are  already  large.  While  scaLar 
operations  can  sometimes  destroy  the  efficiency  of  a vector  code,  we  be- 
lieve that  our  basic  conclusions  will  not  be  altered  by  including  address 
calculations . 
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1.  Block  LU  factorization,  using  scalar  operations  mostly 
Algorithm: 

di  * bi  ; £i  * vi 
for  i = 2 , ...  , N 

solve  d.  (u.  g.  ) = (c.  f.  ) 
l-l  l-l  i-l  l-l  l-l 

d . = b . - a . u . . 
l i ii-l 

fi  = vi  - ai  §1-1 
solve  dN  xN  = fN 

for  i = N-l,  ...  , 1 


X.  = g.  - U.  X. 

1 1 11+1 


Storage  handling: 

If  desired  we  can  overwrite  d.  -»  b . , L.  = a.  d.V  — a u -»  c , 

1 l i ii-l  l i l 

x.  -»  g.  - f.  -*  v..  The  factorization  loop  needs  to  save  either 
1111 

2 

f ^ or  and  g^,  a total  of  n + n stores  and  subsequent  loads, 

d.,  f.  are  saved  then  the  second  loop  would  be 

i i r 


for  i = N- 1,  ...  , 1 

solve  d.x.=f.-c.x.  , 
ii  i i i+l 

with  a corresponding  increase  in  execution  time. 

Variations  of  the  basic  algorithm,  with  estimate  of  minimum  runt 
due  to  load/store  unit: 


Version  1,  a one-time  solution. 

initialization  loads  b^,  v^ 

factorization  loop  loads  c.  ,,  a.,  b.,  v. 

i-l  i l i 

stores  g^  l 

initialization  stores  x^ 


d . and 

i 

If 


ime 
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back  substitution  loop  loads  u.,  g. 

11 

stores  x. 

1 

At  19  cycles  per  load  and  16  per  store, 

2 2 

total  time  £ {l9(4n“  4-  2n)  + 16(n“  4-  2n) } (N- 1) 
4-  19  (n2  4-  n)  4-  16  (n-) 

= (92n“  4-  70n)(N-l)  4-  16n~  4-  35n 
for  n = 2,  total  time  5 503N  - 362 


Version  2,  factor  + solve,  A = ( ^ , I,  0)(0,  d^  , 0 ) (0  , I,  u.) 

o 

Factorization  requires  n“(3N  - 2)  loads  for  A. 

Version  a.  factorization  looo  stores  d.  onlv,  and  comoutes 

i 

l.  - a.  d.\,  u,  = d . ^ c.  by  vector  operations,  vector  length  = 
JJJ-ijJJ 

total  time  £ 19n“(3N  - 2)  + 16n“N  + -r^n"  - n)  D + -r(14n2  - 15n” 

Z 6 

D = t^N  + etc. 

for  n = 2,  total  time  S 323. 5N  4-  3421 
Version  b.  factorization  loop  stores  d^,  u^,  computes  l.  by  vector 
operations,  vector  length  = N. 
total  time  £ 19n2 (3N  - 2)  4-  16n2(2N  - 1) 

+ \(2n~  - n)D  4-  ~(8n2  - 9n~  4-  n)  (M4-S) 
for  n = 2,  total  time  5 373. 5N  + 1769 
Version  c.  factorization  loop  stores  d.,  u.,  1.. 

ill 

total  time  £ 19n2(3N  - 2)  + 16n2(3N  - 2) 
for  n =*  2,  total  time  2 420N  - 280 
Solution  of  Ax  = v,  given  the  factorization  of  A. 


forward  elimination  loads  v^  initially 
loop  loads  l.,  v.,  stores  f. 

i i i 

solve  dj  gj  = f j with  vector  operations. 


N. 

4-  n) (H4S  ) , 
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back  substitution  loads  initially 

loop  loads  u.,  g.,  stores  x. 

i i r 

2 

total  time  s 19(2n  (N-l)  4-  2nN)  + 16(2nN) 

+ -j(n^  + n)  D + -jr(2n3  + 3n~  - 5n)  (M+-S ) 
for  n = 2,  total  time  2 302. 5N  + 1039 


Timings : 

1.  Version  1,  for  initial  comparison,  arithmetic  operations  without 
overlap,  load  Stores  ignored. 

(a)  t^(3n'  + n)tD  + ^(14n3  + 9n~  - 5n)  (tM  + tg ) In 

- fn2tD  + (2n3  + n2) (t^  + tg) } 

(b)  with  STAR  data:  (70n3  + 114n~  - 2n)N  - (60n3  4-  76n“) 

(c)  for  n = 2,  with  STAR  data:  1012N  - 784 


Version  1,  for  second  comparison,  allowing  overlap  and  including 
load/ stores . 

In  the  absence  of  facilities  for  direct  testing  on  STAR,  a scalar 
code  was  written  for  the  special  case  n = 2 and  time  estimates 
computed  from  that  code.  The  details  are  tedious  and  wel 1- illustrated 
by  the  simpler  tridiagonal  code  in  [L2],  so  they  will  not  be  given 
here.  The  main  time  estimates  are 

initialize  - loads  + loop  set  up 
factorization  loop 
Initialize 

back  substitution  loop 

total  time  estimated  to  be 


200  cycles 
400(N-  1) 

250 

200(N-1) 

600N  - 150  cycles 


This  timing  is  probably  optimistic  due  to  unforeseen  memory  and 


overlap  conflicts. 
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Notes : 

1.  Gaussian  elimination,  taking  advantage  of  the  block  structure,  using 

scalar  operations,  requires  the  same  execution  time  as  the  block  LU 
factorization . 

2.  If  u.,  1 s j sn-1,  is  saved  in  the  factorization,  then  ||B[A  ]||may 
be  computed  following  the  first  loop  at  a cost  of 

( (n- 1)  t 4-  ~ ) (nN)  + (n-  1)  a + c 

max  + max 

using  the  vector  maximum  instruction.  This  yields  13N  4-  166  for  n = 
on  STAR. 
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2.  Odd-even  elimination,  using  vector  operations 

We  give  the  algorithm  and  timing  for  N = 2™  - 1,  deriving  formulae 
solely  in  terms  of  N and  m.  Timing  for  other  values  of  N is  approximated 
by  setting  m = ^og^N"!  . 


i-  lx 


vector 

lengths 

N,  N-r 

N-r 

N-r 

N-2r 

N-2r 


Algorithm:  a ^ ^ = a^,  etc. 

for  i = l,2,...,m  (r  = 2 

for  each  j = 1,2,...,N 

solve  b.^(ry.  y.  c c.)  = (a . ^ c.^  v.^) 
J J J J J J J 

take  a.  = y.  = 0,  cp.  = 0 if  j < 1 or  j > N 

J J J 

(i+1)  _ (i)  (i)  (i) 

b.  -b.  -a.  y.  -c.  a., 

J J J J-r  j j+r 

(i+l)  _ (i)  o (i)  (i) 

v.  = v . -a.  co . - c . cd. 

J J J J-r  J J+r 

(i+l)  _ (i) 

a . = -a . or. 

J J J-r 

(i+l)  _ (i) 

c . = -c . V . , 

J J J+r 

for  each  j = 1,2,...,N 

(nH-1)  _ (riH-1) 

solve  b . x . = v . 

J J J 


Storage  handling:  temporary  vectors  are  needed  for  the  factorization  of 

bj  ^ and  for  , y^ , cd^  computed  at  each  stage,  and  we  can  overwrite 

. (i+l)  . (i)  (i+l)  (i) 

b.  -*  b . , a v -a.  , etc. 

J J J J 

Timing : 

1.  arithmetic  only 

(a)  [r(5n2  + n)T  + -|(38n3  + 3n2  - 5n)  t 

l D b M 

+ -|(38n2  - 9n2  - 5n)Tg}Nm 
+ (4(5n2  + n)aD  + -|(38n2  + 3n2  - 5n)a^ 

+ ^(38n^  - 9n“  - 5n)Cg) } m 


■ {'jPn-  - n)To  + -|(46n3  - 3n“  + 5n) 

1,3  2 i 

+ ”^(46n  - 27n“  + 5n)Tg  } N 

+ [(2n  )t  + (2n3  - 2n“)T  + -^(n'  + n)  a 

M S 2 D 

+ -jr(-10n3  + 3n~  - 5n)<r 
6 >1 

+ ^r(-10n3  + 15n-  - 5n)<rs] 

(b)  with  STAR  data, 

£-(33n  + 19n2  - n)Nm  + -^(8740n3  + 2343n~  - 649n)m 
-^(46n3  + n2  + n)N  - |(2282n3  - 2037n2  + 649n) 

(c)  for  n = 2,  with  STAR  data, 

94.5  Mm  + 12999m  - 93.5N  - 1901. 

2.  With  overwriting  of  a.  - c^1'*,  it  is  necessary 

to  develop  the  products  one  row  at  a time  in  a temporary  vector,  and 

then  use  a transmit  operation  on  the  whole  row.  For  specifics  see 

the  discussion  of  odd-even  reduction  to  follow.  The  added  cost  for 

loop  control  and  transmit  instructions  is 
m- 1 - 

40m  + , = 1 2n(^n(N-21)  + 91)  = 40m  + n"(N(m-2)  4-  1)  + 182n(m-l), 

or  4Nm  4-  404m  - 8N  - 360  for  n = 2. 

(a)  with  STAR  data,  total  time  is 

£•( 38n3  + 23n2  - n)Nm  + |(8740n3  4-  2343n2  4-  443n  4-  240)m 

-7<46n3  + 9n2  + n)N  - 7(2282n3  - 2043n2  + ’74 In) 

4 6 

(b)  for  n = 2,  with  STAR  data,  total  time  is 
98 . 5Nm  4-  13403m  W01.5N  - 2261. 
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Notes : 

1.  ||  B[ A ] 1 1 may  be  computed  in  the  first  time  through  the  loop  at  a cost 

of  ((2n-l)T,  + T ) (nN)  + ((2n-l)a,  + c ).  This  yields  15N  4-  308 
for  n = 2 on  STAR.  ||  B [ H ^ ] j | may  be  computed  at  a cost  of  15(N  - 2m  1) 
+ 303  at  the  appropriate  time  through  the  loop. 
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J 


3.  Odd-even  reduction,  using  vector  operations 

As  with  odd-even  elimination,  the  algorithm  and  timings  are  derived 
for  N = 2m  - 1.  For  general  N take  n = ("log  Nl  in  the  timing  formulae, 
even  though  m = riog9N+ln  is  the  proper  choice.  The  algorithm  is  given  in 
detailed  form,  using  the  data  layout  and  storage  handling  described  in  Sec- 
tion 10. B.  For  the  basic  form,  see  Section  4.D. 


Algorithm:  A ^ ^ = A,  = v,  N*  = 2°' 

? 

set  up  control  vector  z = (1  0 1 0 ...),  length  n“N*  bits 
cost  not  counted  in  timing,  but  = one  instruction  on  STAR 

for  i = 1,  2,  . ...  m-1  (JT  = 2"*’1"1'  - 1,  >h*  = 2nH‘1'1) 

,,  „ ,(i)  (i) 

odd- even  separate  A , 

17  2 

8 compress  operations,  cost  = — (3n  + n)N.*  + 736 

O 1 


(i) 


factor  t>2j+1v  , (0  j <:  Ni+1 


(i) 


factorization  overwrites  b„  . , 

2j+l 

need  a temporary  vector,  length  + 

cost  = 4(n^  - n)  D + r(2n'5  - 3n-  + n)  (M  + S)  , 

i b 


vector  lengths  = N. 


soive  b2j+1(l)(*2j+1(l)  v2j+1 


i+1 

(0 


®2j+l 


(iV 


= (a- 


(i) 


"2  j+1 


(i) 


2j+l 


overwrites  a 


'22+1 

(i) 

2 j+1 


<°  »2j+1(1)>,  (0  u s»1+1) 


, etc . 


need  a temporary  vector,  length 
cost  3 (2n  + 1) (nD  + (n2  - n) (M  + S) ) , 


vector  lengths  = j* 


- >2J<“  - -2J 


<»  v (1)  - c 
2 j-  1 


2 j 


(i) 


(i) 


2 j+1 


, (i  ^ j * N.+1) 


overwrites  b^ 


(i) 


need  a temporary  vector,  length 


3 

cost  = 2n  (M  + S) 


14  7 


for  simplicity,  take  vector  lengths  = N * 


(i) 


(i) 


a„,  cp.'~'-c  v*'  cp 

O A '04—1  • TO  • 


(£>  cp  (i) 

2j  92j+l  * 


(i+1)  (i) 

V =W2j  -“2j  +2j-l 

overwrites  w„  . ^ 

2 J 

need  a temporary  vector,  length 
2 

cost  = 2n  (M  + S) 

for  simplicity,  take  vector  lengths  = N. 

L (i+D  =_*(!)*  (i)  ,7  < , < M . 

j 2 j *2j-l  ’ (2  j Ni+1) 

overwrites  a^^  ^ ’ Pr°dvct  developed  one  row  at  a time: 

for  k = 1,  . . . , n 


for  l = 1,  ....  n 

hj  ■aM.2j<i)  °'n,2j-l<1)’  5-l 


for  r = 2 , , n 


(i) 


CU  ’ Cl}  * *kr.2j 0r«.2].l<i)-  <l  s J * *1+l> 

" ^t.y  (1  s L s n;  1 s j SNi+i*) 

3 . 3 2 

cost  = n M + (n  -n)S,  vector  length  » 

+ n transmits,  vector  length  a nN.+1*. 

need  two  temporary  vectors,  length  = nN.  , and  N 

i+l  i+1 

c.(i+1)  3 -c,  (°  v (i),  (l  s j £ V . i) 

j 2 j v2j+l  * U J Ni+1 


as  for  a 


(i+1) 


j 

solve  Xl(m)  * w^13^  , scalar  mode 

cost  = |(n2  + n)D  + ^(2n3  + 3n2  - 5n)  (M  + S) 


for  i = m-  1,  . . . , 1 

x (i>  = cp  (i)  . _ (i)  _ (i+D 

2 j+1  *2  j+1  *2  j+1  Xj 


overwrites  cd, 


(i) 


(i)  v (i+1) 

T2  j+1  xj+l 


- V0<1,  '"x,.,  , (0  * j s N.  + 1) 


2 j+1 


cost  = 2n  (M  + S) , vector  length  » 
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(i+1) 

Xj  -*  temporary 

cost  = transmit,  vector  length  = nN.+^* 

merge  x2j+l^’  temPorary  “* 

cost  = 3nN.*  + 123 
1 


Timings : 

1.  arithmetic  only, 

(a)  {-j(5n2  + n)TD  + -|(33n3  + 15n2  - 5n) 

+ -^(38n3  + 3n2  - 5n)T  }(N-1) 

+ {^(5n2  + n)aQ  + 'jjOSn3  + 15n2  " 5n) 

+ |(33n3  + 3n2  - 5n)a  }(m-l) 

+ {^(n2  + n) tD  + ^(2n3  + 3n2  - 5n) (t^  + tg) ] 

(b)  with  STAR  data:  ^-(38n3  + 31n^  - n)  (N- 1)  + 

4(3740n3  + 5103n2  - 649n)(m-])  + 
o 

(10n3  + 38n2  - 2n) 


(c)  for  n = 2,  with  STAR  data:  106. 5N  + 14839m  - 14717.5 


2.  with  loop  control  and  storage  manipulation  (compress,  merge,  transmit) 
using  STAR  data, 

(a)  for  general  n,  add  80m  + ^-(55n2  + 43n)(N-l)  + (182n  + 950)  (m-1) 

(b)  for  n = 2,  add  76. 5n  + 1394m  - 1390.5,  so 
total  time  = 183N  + 16233m  - 16108 


L 
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Notes : 

1.  An  estimate  for  ||  B[A]||  may  be  computed  in  the  first  time  through  the 
first  loop  at  a cost  of  ((2n-l)T  + r ) (Nn/2)  + ((2n-l)a  + a ). 

This  yields  7.5N  + 308  for  n = 2 on  STAR.  The  estimate  of  ||  B[A]||  is 
simply  the  maximum  row  sum  taken  over  every  other  block  row.  The  true 
value  of  ||  [A]  ||  could  be  found  by  computing  the  remaining  rows  of  B[A], 
but  this  is  more  expensive  and  defeats  the  purpose  of  odd-even  reduction. 
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4.  p-fold  reduction 


The  algorithm  and  timings  are  given  for  N = p™  - 1,  p > 3.  For  general 


N,  take  m * flog  n!  = m/log  p. 


pNi  = m/log2?,  m = log2N ' ; m = 


log  N+l  is  actually  the 
P 


proper  choice. 


Algorithm:  (see  Section  6. A for  details),  N = d 

l r 


rrrt-1-  i 


for  i = 1,  ....  in- 1 

for  each  k = 1,  ....  Ni+1  + 1, 

LU  process  on 

for  each  k = 1,  ....  JJ.  , 

l+l 

UL  process  on  ^ 
generate  A<'i+*\ 
solve  b^  x^  = 
for  i = a-1,  . . . , 1 


for  each  k = 1,  ....  N..  + 1 


back  substitute  for 


\p-j-  1 5 J 


Timing : 


2 13  2 

1.  (a)  for  general  n,  f(5n  + n)r  +T(2on  + 3n  - 5n)  t , 

+ -j(26n3  - 3n2  - 5n)-rs3  (N-p+1) 

+ [5n  + n)cD  + -j(26n3  + 3n2  - 5n)^ 

+ -|(26n3  - 3n2  - 5n)  cs  } (m- 1)  (p- 1) 

+ !^(n2  + n)tD  + ^(2n2  + 3n2  - 5n)(tM+ts)} 
13  2 

(b)  with  STAR  data,  2(26n  + 21n  - n) (N  - p + 1) 

+ -j(5980n  + 2769n“  - 649n)  (m-l)(p-l) 

+ (10n3  + 38n2  - 2n) 
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(c) 


using  STAR  data,  n = 2:  145N  + (19206m  - 19351) (p- 1)  + 228 


= 145N  + 19206m 


(19351p  - 19579) 


2.  data  manipulation  costs  are  outlined  in  Section  10. B.  We  have  not 

derived  the  complete  timing  formulae,  as  the  method  is  expected  to  be 
too  slow. 


Notes : 

1.  In  the  arithmetic  timing,  for  p s 3,  p-fold  reduction  is  always  cheaper 
than  (p-f-l)-fold  reduction  for  any  n.  Comparison  of  2-fold  (odd-even) 
and  3-fold  reduction  yields  an  N(n)  such  that  if  N s N(n)then  odd-even 
reduction  is  faster.  Although  we  have  not  determined  N(n) , the  block 
LU  factorization  will  probably  be  faster  than  either  reduction  method 
for  N < N (n) . Similar  results  should  hold  when  overhead  costs  are 
included  in  the  time  estimates. 
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5 . Mixed  odd-even  reduction  and  LU  factorization 

Algorithm:  do  k odd-even  reductions,  solve  A('k"rl')  x('K+1^  = w<'k+^  by  the 
LU  factorization,  and  back  substitute. 

Timing:  (based  on  N = 2m  - 1) 

1 • (a)  [j(5n“  + n) 4-  ^(38n3  + 15n2  - 5n) 

+ -|(3Sn3  + 3n2  - 5n)-r  }(N  + 1 - 2m-k) 

1 9 13  9 

+ IjCSn"  + nlo^  + — (38n  + 15n“  - 5n)<7^ 

+ |(33n3  + 3n2  - 5n);r  }k 

+ {— (3n  + n)  tQ  + -|(14n3  + 9n~  - 5n)(t^  + tg)}(2m  k-l) 

- (n2tD  + (2n3  + n2)(tM  + tg)} 

The  timing  difference  between  odd-even  reduction  and  the  mixed  algo- 
rithm will  be  an  exoression  of  the  form  'y  (m-k)  - 3 (2m_k)  - v , and  k 

n n n 

should  be  chosen  to  maximize  this  expression.  For  n = 2 using  STAR 
data,  we  have  -y^  = 14839,  3^  = 905.5,  v = 13023,  k = m-5,  yielding 

(c)  106. 5N  + 14839m  - 46908.5 

2.  For  n = 2 and  STAR  data,  the  polyalgorithm  time  is 

183N  + 16233k  + 417(2  ‘ k)  + 33,  and  the  optimal  choice  is  k = m-6, 
yielding 


(b)  183N  + 16233m  - 70677 


15  3 


6 . Incomplete  reduction 

Algorithm:  do  k odd-even  reductions,  solve  D[A^k+^  ] y 
back  substitute  to  obtain  y = y^  ^ = 


(k+l)  _ (k+1) 

W 


X. 


Timing:  (based  on  N = 2m  - 1) 

1.  (a)  {|(5n2  + n)T  + -r(38n3  + 15n2  - 5n)T 

^ Do  M 

+ x(38n3  + 3n2  - 5n)r  } (N  4-  1 - 2m“k) 

D S 

+ {jCSn-  4-  n)uD  + -|(38n3  4-  15n"  - 5n) 

+ — (38n3  4-  3n“  - 5n)cr„}k 
6 a 

+ '^(n2  + n)To  + |(2n3  + 3n“  - 5n)  (t^  + Ts)}(2m"k  - 1) 

+ {^O-  4-  n)a  + r(2n3  4-  3n2  - 5n)  (e  4-  a ) } 

L Do  Mb 


(b)  general  n,  STAR  data: 

£(38n3  4-  31n2  - n)N  4-  |(8740n3  4-  5103n2  - 649n)  k 
- (9n3  4-  6n2)(2m"k  - 1)  4-  -|(460n3  4-  1191n2  - 649n) 

(c)  using  STAR  data,  n = 2:  106. 5N  4-  14839k  - 96(2m_k)  4-  1287 


2.  (a)  general  n,  STAR  data: 


(b) 


^(38n3  4-  86n2  4-  42n)N  4-  |(8740n3  4-  5103n2  4-  443n  4-  5700)k 
- ^(36n3  4-  79n2  4-  51n)(2m_k  - 1) 

13  2 

4-  — (460n  4-  119 In  - 1651n)  4-  80(k4-l) 

n = 2,  STAR  data: 

183N  4-  16233k  - 176.5(2m'k)  4-  1113.5 


and 
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7 . Jacobi  iteration,  serai- Iteration 

basic  step:  = B[A]y^  + D[A] 

for  each  j = 1,  . . . , N 

, , (i+1)  (i)  (i) 

solve  b.y.  = v.  - a.y.  , - c.y.  , 

J J J J J-l  J J+l 

Timing  (identical  for  both  vector  computer  models) : 

1.  (a)  j(n2  + n)  (tqN  + crQ)  + -|(2n3  + 15n^  - 5n)((TM  + -g)N  + (ct^  + cj) 

1.  (b)  = 2.  (a)  using  STAR  data, 

^(2n3  + 19n2  - n)N  + |(460n3  + 3951n2  - 649n) 

1.  (c)  = 2.  (b)  using  STAR  data,  n = 2:  22. 5N  4-  3031 


Optional  preprocessing,  times  for  n = 2,  STAR  data 


(1)  solve  b.(a.  v.  cp.)  =(a.  c.  v.), 

J J J J J J J 

cost  = 38. 5N  + 4367.  The  basic  step  is  then 

(i+1)  _ (i)  (i) 

y . = co.  - a.y . , - v.y . , , , 

J J J J-l  J J+l 

cost  = 12N  + 1840. 


(2)  factor  b ^ , cost  = 3.5N  + 367.  The  basic  step  is  unchanged  from  the 
original,  but  cost  is  19N  + 2634.  This  is  used  in  the  mixed  incom- 
plete reduction/ Jacobi  algorithm. 


Semi- iteration,  basic  step: 

y^m3  = result  of  Jacobi  step  from  y^ 
- .al  (y(m)  - + y'"-1’ 

Vi  ■ 1/(1  - 

p - p(B[A]) . 


p 


- ^ 

Timing,  n = 2,  using  STAR  data:  26. 5N  + 3408,  or  16N  + 2217  with  the  first 
preprocessing . 

Notes : 

] 

1.  Varga  [V3,  p.  139]  shows  that  for  Jacobi  semi- iteration , 


A conservative  estimate  for  the  number  of  iterations  needed  to  obtain 
||  e^  ||  ^ 1/n ||  e^  ||  is  n = 21ogN  (-  log  (x^-  1) ) • It  follows  that 
the  execution  time  for  an  iterative  solution  is  0(N  log  N)  vs.  0(N) 


for  the  better  direct  methods. 


8 . Incomplete  reduction  plus  Jacobi  iterations 

Algorithm:  do  k odd-even  reductions,  solve  D[A^k+^  ]y(k+^  = w^k+^  f do 

l Jacobi  iterations  to  improve  y(k+l)^  and  back  substitute. 


4 , r , • , (k+1) n (k+1)  (k+1)  , f ‘ , 

As  a result  of  solving  D[A  Jy  = w , we  have  factors  for 

D[A^k+^].  This  means  the  cost  is  (using  STAR  data,  n = 2) 


rn m If  m.  1/ 

1.  (c)  106. 5N  + 14339k  - 96(2  K)  + 1287  + 2(19(2  ) + 2615). 

2.  (b^  133N  + 16233k  - 176.5(2m"k)  + 1113.5 

+ 2(19(2m'k)  + 2615)  . 
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